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To  Sabine 


Preface 


When  I  worked  on  my  Introduction  to  Multiple  Time  Series  Analysis  (Lutke- 
pohl  (1991)),  a  suitable  textbook  for  this  field  was  not  available.  Given  the 
great  importance  these  methods  have  gained  in  applied  econometric  work,  it 
is  perhaps  not  surprising  in  retrospect  that  the  book  was  quite  successful. 
Now,  almost  one  and  a  half  decades  later  the  field  has  undergone  substantial 
development  and,  therefore,  the  book  does  not  cover  all  topics  of  my  own 
courses  on  the  subject  anymore.  Therefore,  I  started  to  think  about  a  serious 
revision  of  the  book  when  I  moved  to  the  European  University  Institute  in 
Florence  in  2002.  Here  in  the  lovely  hills  of  Toscany  I  had  the  time  to  think 
about  bigger  projects  again  and  decided  to  prepare  a  substantial  revision  of 
my  previous  book.  Because  the  label  Second  Edition  was  already  used  for  a 
previous  reprint  of  the  book,  I  decided  to  modify  the  title  and  thereby  hope 
to  signal  to  potential  readers  that  significant  changes  have  been  made  relative 
to  my  previous  multiple  time  series  book. 

Although  Chapters  1-5  still  contain  an  introduction  to  the  vector  autore¬ 
gressive  (VAR)  methodology  and  their  structure  is  largely  the  same  as  in 
Liitkepohl  (1991),  there  have  been  some  adjustments  and  additions,  partly 
in  response  to  feedback  from  students  and  colleagues.  In  particular,  some 
discussion  on  multi-step  causality  and  also  bootstrap  inference  for  impulse 
responses  has  been  added.  Moreover,  the  LM  test  for  residual  autocorrela¬ 
tion  is  now  presented  in  addition  to  the  portmanteau  test  and  Chow  tests  for 
structural  change  are  discussed  on  top  of  the  previously  considered  prediction 
tests.  When  I  wrote  my  first  book  on  multiple  time  series,  the  cointegration 
revolution  had  just  started.  Hence,  only  one  chapter  was  devoted  to  the  topic. 
By  now  the  related  models  and  methods  have  become  far  more  important  for 
applied  econometric  work  than,  for  example,  vector  autoregressive  moving  av¬ 
erage  (VARMA)  models.  Therefore,  Part  II  (Chapters  6-8)  is  now  entirely  de¬ 
voted  to  VAR  models  with  cointegrated  variables.  The  basic  framework  in  this 
new  part  is  the  vector  error  correction  model  (VECM).  Chapter  9  is  also  new. 
It  contains  a  discussion  of  structural  vector  autoregressive  and  vector  error 
correction  models  which  are  by  now  also  standard  tools  in  applied  econometric 
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analysis.  Chapter  10  on  systems  of  dynamic  simultaneous  equations  maintains 
much  of  the  contents  of  the  corresponding  chapter  in  Liitkepohl  (1991).  Some 
discussion  of  nonstationary,  integrated  series  has  been  added,  however.  Chap¬ 
ters  9  and  10  together  constitute  Part  III.  Given  that  the  research  activities 
devoted  to  VAR.MA  models  have  been  less  important  than  those  on  cointegra¬ 
tion,  I  have  shifted  them  to  Part  IV  (Chapters  11-15)  of  the  new  book.  This 
part  also  contains  a  new  chapter  on  cointegrated  VAR.MA  models  (Chapter 
14)  and  in  Chapter  15  on  infinite  order  VAR  models,  a  section  on  models 
with  cointegrated  variables  has  been  added.  The  last  part  of  the  new  book 
contains  three  chapters  on  special  topics  related  to  multiple  time  series.  One 
chapter  deals  with  autoregressive  conditional  heteroskedasticity  (Chapter  16) 
and  is  new,  whereas  the  other  two  chapters  on  periodic  models  (Chapter  17) 
and  state  space  models  (Chapter  18)  are  largely  taken  from  Liitkepohl  (1991). 
All  chapters  have  been  adjusted  to  account  for  the  new  material  and  the  new 
structure  of  the  book.  In  some  instances,  also  the  notation  has  been  modified. 
In  Appendix  A,  some  additional  matrix  results  are  presented  because  they 
are  used  in  the  new  parts  of  the  text.  Also  Appendix  C  has  been  expanded 
by  sections  on  unit  root  asymptotics.  These  results  are  important  in  the  more 
extensive  discussion  of  cointegration.  Moreover,  the  discussion  of  bootstrap 
methods  in  Appendix  D  has  been  revised.  Generally,  I  have  added  many  new 
references  and  consequently  the  reference  list  is  now  much  longer  than  in  the 
previous  version.  To  keep  the  length  of  the  book  in  acceptable  bounds,  I  have 
also  deleted  some  material  from  the  previous  version.  For  example,  station¬ 
ary  reduced  rank  VAR  models  are  just  mentioned  as  examples  of  models  with 
nonlinear  parameter  restrictions  and  not  discussed  in  detail  anymore.  Reduced 
rank  models  are  now  more  important  in  the  context  of  cointegration  analysis. 
Also  the  tables  with  example  time  series  are  not  timely  anymore  and  have 
been  eliminated.  The  example  time  series  are  available  from  my  webpage  and 
they  can  also  be  downloaded  from  www.jmulti.de.  It  is  my  hope  that  these 
revisions  make  the  book  more  suitable  for  a  modern  course  on  multiple  time 
series  analysis. 

Although  multiple  time  series  analysis  is  applied  in  many  disciplines,  I  have 
prepared  the  text  with  economics  and  business  students  in  mind.  The  exam¬ 
ples  and  exercises  are  chosen  accordingly.  Despite  this  orientation,  I  hope  that 
the  book  will  also  serve  multiple  time  series  courses  in  other  fields.  It  contains 
enough  material  for  a  one  semester  course  on  multiple  time  series  analysis.  It 
may  also  be  combined  with  univariate  times  series  books  or  with  texts  like 
Fuller  (1976)  or  Hamilton  (1994)  to  form  the  basis  of  a  one  or  two  semester 
course  on  univariate  and  multivariate  time  series  analysis.  Alternatively,  it  is 
also  possible  to  select  some  of  the  chapters  or  sections  for  a  special  topic  of  a 
graduate  level  econometrics  course.  For  example,  Chapters  1-8  could  be  used 
for  an  introduction  to  stationary  and  cointegrated  VARs.  For  students  already 
familiar  with  these  topics,  Chapter  9  could  be  a  special  topic  on  structural 
VAR.  modelling  in  an  advanced  econometrics  course. 
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The  students  using  the  book  must  have  knowledge  of  matrix  algebra  and 
should  also  have  been  introduced  to  mathematical  statistics,  for  instance, 
based  on  textbooks  like  Mood,  Graybill  &  Boes  (1974),  Hogg  &  Craig  (1978) 
or  Rohatgi  (1976).  Moreover,  a  working  knowledge  of  the  Box-Jenkins  ap¬ 
proach  and  other  univariate  time  series  techniques  is  an  advantage.  Although, 
in  principle,  it  may  be  possible  to  use  the  present  text  without  any  prior 
knowledge  of  univariate  time  series  analysis  if  the  instructor  provides  the 
required  motivation,  it  is  clearly  an  advantage  to  have  some  time  series  back¬ 
ground.  Also,  a  previous  introduction  to  econometrics  will  be  helpful.  Matrix 
algebra  and  an  introductory  mathematical  statistics  course  plus  the  multiple 
regression  model  are  necessary  prerequisites. 

As  the  previous  book,  the  present  one  is  meant  to  be  an  introductory 
exposition.  Hence,  I  am  not  striving  for  utmost  generality.  For  instance,  quite 
often  I  use  the  normality  assumption  although  the  considered  results  hold 
under  more  general  conditions.  The  emphasis  is  on  explaining  the  underlying 
ideas  and  not  on  generality.  In  Chapters  2-7  a  number  of  results  are  proven 
to  illustrate  some  of  the  techniques  that  are  often  used  in  the  multiple  time 
series  arena.  Most  proofs  may  be  skipped  without  loss  of  continuity.  Therefore 
the  beginning  and  the  end  of  a  proof  are  usually  clearly  marked.  Many  results 
are  summarized  in  propositions  for  easy  reference. 

Exercises  are  given  at  the  end  of  each  chapter  with  the  exception  of  Chap¬ 
ter  1.  Some  of  the  problems  may  be  too  difficult  for  students  without  a  good 
formal  training,  some  are  just  included  to  avoid  details  of  proofs  given  in  the 
text.  In  most  chapters  empirical  exercises  are  provided  in  addition  to  algebraic 
problems.  Solving  the  empirical  problems  requires  the  use  of  a  computer.  Ma¬ 
trix  oriented  software  such  as  GAUSS,  MATLAB,  or  Ox  will  be  most  helpful. 
Most  of  the  empirical  exercises  can  also  be  done  with  the  easy-to-use  software 
JMulTi  (see  Liitkepohl  &  Kratzig  (2004))  which  is  available  free  of  charge  at 
the  website  www.jmulti.de.  The  data  needed  for  the  exercises  are  also  available 
at  that  website,  as  mentioned  earlier. 

Many  persons  have  contributed  directly  or  indirectly  to  this  book  and  I  am 
very  grateful  to  all  of  them.  Many  students  and  colleagues  have  commented 
on  my  earlier  book  on  the  topic.  Thereby  they  have  helped  to  improve  the 
presentation  and  to  correct  errors.  A  number  of  colleagues  have  commented 
on  parts  of  the  manuscript  and  have  been  available  for  discussions  on  the 
topics  covered.  These  comments  and  discussions  have  been  very  helpful  for 
my  own  understanding  of  the  subject  and  have  resulted  in  improvements  to 
the  manuscript. 

Although  the  persons  who  have  contributed  to  the  project  in  some  way  or 
other  are  too  numerous  to  be  listed  here,  I  wish  to  express  my  special  grati¬ 
tude  to  some  of  them.  Because  some  parts  of  the  old  book  are  still  maintained, 
it  is  only  fair  to  mention  those  who  have  helped  in  a  special  way  in  the  prepa¬ 
ration  of  that  book.  They  include  Theo  Dykstra  who  read  and  commented 
on  a  large  part  of  the  manuscript  during  his  visit  in  Kiel  in  the  summer  of 
1990,  Hans-Eggert  Reimers  who  read  the  entire  manuscript,  suggested  many 
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improvements,  and  pointed  out  numerous  errors,  Wolfgang  Schneider  who 
helped  with  examples  and  also  commented  on  parts  of  the  manuscript  as  well 
as  Bernd  Theilen  who  prepared  the  final  versions  of  most  figures,  and  Knut 
Haase  and  Holger  Claessen  who  performed  the  computations  for  many  of  the 
examples.  I  deeply  appreciate  the  help  of  all  these  collaborators. 

Special  thanks  for  comments  on  parts  of  the  new  book  go  to  Pentti  Saikko- 
nen  for  helping  with  Part  II  and  to  Ralf  Briiggemann,  Helmut  Herwartz,  and 
Martin  Wagner  for  reading  Chapters  9,  16,  and  18,  respectively.  Christian 
Kascha  prepared  some  of  the  new  figures  and  my  wife  Sabine  helped  with 
the  preparation  of  the  author  index.  Of  course,  I  assume  full  responsibility 
for  any  remaining  errors,  in  particular,  as  I  have  keyboarded  large  parts  of 
the  manuscript  myself.  A  preliminary  HT[h]X  version  of  parts  of  the  old  book 
was  provided  by  Springer-Verlag.  I  thank  Martina  Bihn  for  taking  charge  of 
the  project  on  the  side  of  Springer-Verlag.  Needless  to  say,  I  welcome  any 
comments  by  readers. 


Florence  and  Berlin, 


Helmut  Liitkepohl 
March  2005 
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Introduction 


1.1  Objectives  of  Analyzing  Multiple  Time  Series 

In  making  choices  between  alternative  courses  of  action,  decision  makers  at 
all  structural  levels  often  need  predictions  of  economic  variables.  If  time  series 
observations  are  available  for  a  variable  of  interest  and  the  data  from  the 
past  contain  information  about  the  future  development  of  a  variable,  it  is 
plausible  to  use  as  forecast  some  function  of  the  data  collected  in  the  past.  For 
instance,  in  forecasting  the  monthly  unemployment  rate,  from  past  experience 
a  forecaster  may  know  that  in  some  country  or  region  a  high  unemployment 
rate  in  one  month  tends  to  be  followed  by  a  high  rate  in  the  next  month. 
In  other  words,  the  rate  changes  only  gradually.  Assuming  that  the  tendency 
prevails  in  future  periods,  forecasts  can  be  based  on  current  and  past  data. 

Formally,  this  approach  to  forecasting  may  be  expressed  as  follows.  Let  yt 
denote  the  value  of  the  variable  of  interest  in  period  t.  Then  a  forecast  for 
period  T  +  h,  made  at  the  end  of  period  T,  may  have  the  form 

yT+h  =  f{yT,yT- (l-l.l) 

where  /(•)  denotes  some  suitable  function  of  the  past  observations  j/y,  yx-i, 
. . ..  For  the  moment  it  is  left  open  how  many  past  observations  enter  into 
the  forecast.  One  major  goal  of  univariate  time  series  analysis  is  to  specify 
sensible  forms  of  functions  /(•)■  In  many  applications,  linear  functions  have 
been  used  so  that,  for  example, 

yT+h  =  v  +  ctiyT  +  a2yr- 1  H - ■ 

In  dealing  with  economic  variables,  often  the  value  of  one  variable  is  not 
only  related  to  its  predecessors  in  time  but,  in  addition,  it  depends  on  past 
values  of  other  variables.  For  instance,  household  consumption  expenditures 
may  depend  on  variables  such  as  income,  interest  rates,  and  investment  ex¬ 
penditures.  If  all  these  variables  are  related  to  the  consumption  expenditures 
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it  makes  sense  to  use  their  possible  additional  information  content  in  forecast¬ 
ing  consumption  expenditures.  In  other  words,  denoting  the  related  variables 
by  yit,  D2t,  ■  ■  ■  ,UKt,  the  forecast  of  yi,r+h  at  the  end  of  period  T  may  be  of 
the  form 


Vl,T+h  —  fl(Vl,T,  2/2, T,  ■■  ■  ,  yK,T ,  2/1, T-l,  2/2, T-l,  •  •  •  ,  yx,T-l,  2/1, T-2,  •  •  •)■ 

Similarly,  a  forecast  for  the  second  variable  may  be  based  on  past  values  of 
all  variables  in  the  system.  More  generally,  a  forecast  of  the  k- th  variable  may 
be  expressed  as 

Vk.T+h  =  fkiVl.T,  •  •  •  ,  yx,T,  2/1, T-l,  •  •  •  ,  VK,T-1,  •  •  (1.1.2) 

A  set  of  time  series  i/fet,  k  =  1, . . . ,  K,  t  =  1, . . . ,  T,  is  called  a  multiple  time 
series  and  the  previous  formula  expresses  the  forecast  ijk.T+h  as  a  function 
of  a  multiple  time  series.  In  analogy  with  the  univariate  case,  it  is  one  ma¬ 
jor  objective  of  multiple  time  series  analysis  to  determine  suitable  functions 
fi, ,  5k  that  may  be  used  to  obtain  forecasts  with  good  properties  for  the 
variables  of  the  system. 

It  is  also  often  of  interest  to  learn  about  the  dynamic  interrelationships 
between  a  number  of  variables.  For  instance,  in  a  system  consisting  of  invest¬ 
ment,  income,  and  consumption  one  may  want  to  know  about  the  likely  impact 
of  a  change  in  income.  What  will  be  the  present  and  future  implications  of 
such  an  event  for  consumption  and  for  investment?  Under  what  conditions 
can  the  effect  of  an  increase  in  income  be  isolated  and  traced  through  the  sys¬ 
tem?  Alternatively,  given  a  particular  subject  matter  theory,  is  it  consistent 
with  the  relations  implied  by  a  multiple  time  series  model  which  is  developed 
with  the  help  of  statistical  tools?  These  and  other  questions  regarding  the 
structure  of  the  relationships  between  the  variables  involved  are  occasionally 
investigated  in  the  context  of  multiple  time  series  analysis.  Thus,  obtaining 
insight  into  the  dynamic  structure  of  a  system  is  a  further  objective  of  multiple 
time  series  analysis. 


1.2  Some  Basics 

In  the  following  chapters,  we  will  regard  the  values  that  a  particular  economic 
variable  has  assumed  in  a  specific  period  as  realizations  of  random  variables.  A 
time  series  will  be  assumed  to  be  generated  by  a  stochastic  process.  Although 
the  reader  is  assumed  to  be  familiar  with  these  terms,  it  may  be  useful  to 
briefly  review  some  of  the  basic  definitions  and  expressions  at  this  point,  in 
order  to  make  the  underlying  concepts  precise. 

Let  (ft,  T ,  Pr)  be  a  probability  space,  where  ft  is  the  set  of  all  elementary 
events  (sample  space),  IF  is  a  sigma-algebra  of  events  or  subsets  of  ft  and  Pr 
is  a  probability  measure  defined  on  T .  A  random  variable  y  is  a  real  valued 
function  defined  on  ft  such  that  for  each  real  number  c,  Ac  =  {w  £  ft\y(u>)  < 
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c}  £  T .  In  other  words,  Ac  is  an  event  for  which  the  probability  is  defined  in 
terms  of  Pr.  The  function  F  :  R.  — >  [0, 1],  defined  by  F(c)  =  Pr(Ac),  is  the 
distribution  function  of  y. 

A  A'-dimensional  random  vector  or  a  A-dimensional  vector  of  random 
variables  is  a  function  y  from  fl  into  the  A'-dimensional  Euclidean  space 
that  is,  y  maps  w  e  12  on  y(to)  =  (yi(w), . . . ,  yx (w))'  such  that  for  each 
c  =  (ci, . . . ,  cK )'  6  Ka', 

Ac  =  {w|j/i(w)  <  ci, . . . ,  yK M  <  cK }  e  T. 

The  function  F  :  — >  [0, 1]  defined  by  F(c)  =  Pr(A0)  is  the  joint  distribution 

function  of  y. 

Suppose  Z  is  some  index  set  with  at  most  countably  many  elements  like,  for 
instance,  the  set  of  all  integers  or  all  positive  integers.  A  (discrete)  stochastic 
process  is  a  real  valued  function 

y  :  Z  x  17  — >  R 

such  that  for  each  fixed  t  £  Z,  y(t,u>)  is  a  random  variable.  The  random 
variable  corresponding  to  a  fixed  t  is  usually  denoted  by  yt  in  the  following. 
The  underlying  probability  space  will  usually  not  even  be  mentioned.  In  that 
case,  it  is  understood  that  all  the  members  yt  of  a  stochastic  process  are 
defined  on  the  same  probability  space.  Usually  the  stochastic  process  will  also 
be  denoted  by  yt  if  the  meaning  of  the  symbol  is  clear  from  the  context. 

A  stochastic  process  may  be  described  by  the  joint  distribution  functions 
of  all  finite  subcollections  of  yt’s,  t  6  S  C  Z.  In  practice,  the  complete  system 
of  distributions  will  often  be  unknown.  Therefore,  in  the  following  chapters,  we 
will  often  be  concerned  with  the  first  and  second  moments  of  the  distributions. 
In  other  words,  we  will  be  concerned  with  the  means  E(yt)  =  /Xt,  the  variances 
E[(yt  -  /it)2]  and  the  covariances  E[(yt  -  yt){ys  -  Ms)]- 

A  A'-dimensional  vector  stochastic  process  or  multivariate  stochastic  pro¬ 
cess  is  a  function 

y  :  Z  x  fl  ->  Ra, 

where,  for  each  fixed  f  €  Z,  y(t,u>)  is  a  A'-dimensional  random  vector.  Again 
we  usually  use  the  symbol  yt  for  the  random  vector  corresponding  to  a  fixed 
t  £  Z.  For  simplicity,  we  also  often  denote  the  complete  process  by  yt-  The  par¬ 
ticular  meaning  of  the  symbol  should  be  clear  from  the  context.  With  respect 
to  the  stochastic  characteristics  the  same  applies  as  for  univariate  processes. 
That  is,  the  stochastic  characteristics  are  summarized  in  the  joint  distribution 
functions  of  all  finite  subcollections  of  random  vectors  yt-  In  practice,  inter¬ 
est  will  often  focus  on  the  first  and  second  moments  of  all  random  variables 
involved. 

A  realization  of  a  (vector)  stochastic  process  is  a  sequence  (of  vectors) 
yt( ui),  t  £  Z,  for  a  fixed  to.  In  other  words,  a  realization  of  a  stochastic  process 
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is  a  function  Z  — >  RA  where  t  —>  ytipi).  A  (multiple)  time  series  is  regarded 
as  such  a  realization  or  possibly  a  finite  part  of  such  a  realization,  that  is, 
it  consists,  for  instance,  of  values  (vectors)  yi{u>),  ■  ■  ■ ,  j/t(w)-  The  underlying 
stochastic  process  is  said  to  have  generated  the  (multiple)  time  series  or  it  is 
called  the  generating  or  generation  process  of  the  time  series  or  the  data  gen¬ 
eration  process  (DGP).  A  time  series  yi(u>), . . . ,  j/t(w)  will  usually  be  denoted 
by  2/1, . . .  ,j/T  or  simply  by  yt  just  like  the  underlying  stochastic  process,  if  no 
confusion  is  possible.  The  number  of  observations,  T,  is  called  the  sample  size 
or  time  series  length.  With  this  terminology  at  hand,  we  may  now  return  to 
the  problem  of  specifying  forecast  functions. 


1.3  Vector  Autoregressive  Processes 

Because  linear  functions  are  relatively  easy  to  deal  with,  it  makes  sense  to 
begin  with  forecasts  that  are  linear  functions  of  past  observations.  Let  us 
consider  a  univariate  time  series  yt  and  a  forecast  h  =  1  period  into  the 
future.  If  /(•)  in  (1.1.1)  is  a  linear  function,  we  have 

Vt+1  =  v  +  OL\yT  +  a2yT~ i  H - • 

Assuming  that  only  a  finite  number  p,  say,  of  past  y  values  are  used  in  the 
prediction  formula,  we  get 

yT+i  =v  +  aiyT  +  a2yT- 1  H - 1-  apyT-P+ 1-  (1.3.1) 

Of  course,  the  true  value  yr+i  will  usually  not  be  exactly  equal  to  the  forecast 
2/t+i-  Let  us  denote  the  forecast  error  by  ut+  i  :=  yr+i  ~  2/t+i  so  that 

2/t+i  =  2/t+i  +  ut+i  =  v  +  aiyr  +  •  •  •  +  apyT-P+ i  +  ut+  i-  (1.3.2) 

Now,  assuming  that  our  numbers  are  realizations  of  random  variables  and 
that  the  same  data  generation  law  prevails  in  each  period  T,  (1.3.2)  has  the 
form  of  an  autoregressive  process, 

yt  =  v  +  aiyt-i  H - b  apyt-p  +  ut,  (1.3.3) 

where  the  quantities  yt,  yt-i,  ■  ■  ■  ,yt-p,  and  wt  are  now  random  variables.  To 
actually  get  an  autoregressive  (AR)  process  we  assume  that  the  forecast  errors 
Ut  for  different  periods  are  uncorrelated,  that  is,  Ut  and  us  are  uncorrelated 
for  s  ^  t.  In  other  words,  we  assume  that  all  useful  information  in  the  past 
2/t’s  is  used  in  the  forecasts  so  that  there  are  no  systematic  forecast  errors. 

If  a  multiple  time  series  is  considered,  an  obvious  extension  of  (1.3.1)  would 
be 

=  v  +  aki,ryi,T  +  <afc2,i2/2,T  +  •  •  •  +  akK.iyx.T 

H - +  otki,Pyi,T-P+i  +  •  •  •  +  akK,PyK,T-P+i , 

k=l,...,K. 


Vk.T+l 


(1.3.4) 
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To  simplify  the  notation,  let  yt  :=  (yu,  ■  ■  ■ ,yxt )',  Vt  ■=  (vu,  ■  ■  - ,  yxt )',  v  := 
(z'l ,...,uK)'  and 


Ai 


all,i 


OLKl,i 


Oi\K,i 

OtKK,i 


Then  (1.3.4)  can  be  written  compactly  as 


Vt+ i  —  v  +  A\ yx  +  •  •  •  +  ApyT_p+i.  (1.3.5) 

If  the  t/t’s  are  regarded  as  random  vectors,  this  predictor  is  just  the  optimal 
forecast  obtained  from  a  vector  autoregressive  model  of  the  form 

Vt  =  v  +  A\yt-i  +  •  •  •  +  Apyt-p  +  Ut,  (1.3.6) 

where  the  ut  =  (%, . . .  ,uxtY  form  a  sequence  of  independently  identically 
distributed  random  AT-vectors  with  zero  mean  vector. 

Obviously  such  a  model  represents  a  tremendous  simplification  compared 
with  the  general  form  (1.1.2).  Because  of  its  simple  structure,  it  enjoys  great 
popularity  in  applied  work.  We  will  study  this  particular  model  in  the  follow¬ 
ing  chapters  in  some  detail. 


1.4  Outline  of  the  Following  Chapters 

In  Part  I  of  the  book,  consisting  of  the  next  four  chapters,  we  will  investigate 
some  basic  properties  of  stationary  vector  autoregressive  (VAR)  processes  such 
as  (1.3.6).  Forecasts  based  on  these  processes  are  discussed  and  it  is  shown 
how  VAR  processes  may  be  used  for  analyzing  the  dynamic  structure  of  a  sys¬ 
tem  of  variables.  Throughout  Chapter  2,  it  is  assumed  that  the  process  under 
study  is  completely  known  including  its  coefficient  matrices.  In  practice,  for 
a  given  multiple  time  series,  first  a  model  of  the  DGP  has  to  be  specified 
and  its  parameters  have  to  be  estimated.  Then  the  adequacy  of  the  model 
is  checked  by  various  statistical  tools  and  then  the  estimated  model  can  be 
used  for  forecasting  and  dynamic  or  structural  analysis.  The  main  steps  of 
a  VAR  analysis  are  presented  in  Figure  1.1  in  a  schematic  way.  Estimation 
and  model  specification  are  discussed  in  Chapters  3  and  4,  respectively.  In  the 
former  chapter  the  estimation  of  the  VAR  coefficients  is  considered  and  the 
consequences  of  using  estimated  rather  than  known  processes  for  forecasting 
and  economic  analysis  are  explored.  In  Chapter  4,  the  specification  and  model 
checking  stages  of  an  analysis  are  considered.  Criteria  for  determining  the  or¬ 
der  p  of  a  VAR  process  are  given  and  possibilities  for  checking  the  assumptions 
underlying  a  VAR  analysis  are  discussed. 

In  systems  with  many  variables  and/or  large  VAR  order  p,  the  number 
of  coefficients  is  quite  substantial.  As  a  result  the  estimation  precision  will 
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model 

rejected 


Fig.  1.1.  VAR  analysis. 


be  low  if  estimation  is  based  on  time  series  of  the  size  typically  available 
in  economic  applications.  In  order  to  improve  the  estimation  precision,  it  is 
useful  to  place  restrictions  from  nonsample  sources  on  the  parameters  and 
thereby  reduce  the  number  of  coefficients  to  be  estimated.  In  Chapter  5,  VAR 
processes  with  parameter  constraints  and  restricted  estimation  are  discussed. 
Zero  restrictions,  nonlinear  constraints,  and  Bayesian  estimation  are  treated. 

In  Part  I,  stationary  processes  are  considered  which  have  time  invariant 
expected  values,  variances,  and  covariances.  In  other  words,  the  first  and  sec¬ 
ond  moments  of  the  random  variables  do  not  change  over  time.  In  practice 
many  time  series  have  a  trending  behavior  which  is  not  compatible  with  such 
an  assumption.  This  fact  is  recognized  in  Part  II,  where  VAR  processes  with 
stochastic  and  deterministic  trends  are  considered.  Processes  with  stochastic 
trends  are  often  called  integrated  and  if  two  or  more  variables  are  driven  by 
the  same  stochastic  trend,  they  are  called  cointegrated.  Cointegrated  VAR 
processes  have  quite  different  properties  from  stationary  ones  and  this  has 
to  be  taken  into  account  in  the  statistical  analysis.  The  specific  estimation, 
specification,  and  model  checking  procedures  are  discussed  in  Chapters  6-8. 

The  models  discussed  in  Parts  I  and  II  are  essentially  reduced  form  models 
which  capture  the  dynamic  properties  of  the  variables  and  are  useful  forecast¬ 
ing  tools.  For  structural  economic  analysis,  these  models  are  often  insufficient 
because  different  economic  theories  may  be  compatible  with  the  same  sta- 
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tistical  reduced  form  model.  In  Chapter  9,  it  is  discussed  how  to  integrate 
structural  information  in  stationary  and  cointegrated  VAR.  models.  In  many 
econometric  applications  it  is  assumed  that  some  of  the  variables  are  de¬ 
termined  outside  the  system  under  consideration.  In  other  words,  they  are 
exogenous  or  unmodelled  variables.  VAR  processes  with  exogenous  variables 
are  considered  in  Chapter  10.  In  the  econometrics  literature  such  systems 
are  often  called  systems  of  dynamic  simultaneous  equations.  In  the  time  se¬ 
ries  literature  they  are  sometimes  referred  to  as  multivariate  transfer  function 
models.  Together  Chapters  9  and  10  constitute  Part  III  of  this  volume. 

In  Part  IV  of  the  book,  it  is  recognized  that  an  upper  bound  p  for  the  VAR 
order  is  often  not  known  with  certainty.  In  such  a  case,  one  may  not  want 
to  impose  any  upper  bound  and  allow  for  an  infinite  VAR  order.  There  are 
two  ways  to  make  the  estimation  problem  for  the  potentially  infinite  number 
of  parameters  tractable.  First,  it  may  be  assumed  that  they  depend  on  a 
finite  set  of  parameters.  This  assumption  leads  to  vector  autoregressive  moving 
average  (VARMA)  processes.  Some  properties  of  these  processes,  parameter 
estimation  and  model  specification  are  discussed  in  Chapters  11-13  for  the 
stationary  case  and  in  Chapter  14  for  cointegrated  systems.  In  the  second 
approach  for  dealing  with  infinite  order  VAR  processes,  it  is  assumed  that 
finite  order  VAR  processes  are  fitted  and  that  the  VAR  order  goes  to  infinity 
with  the  sample  size.  This  approach  and  its  consequences  for  the  estimators, 
forecasts,  and  structural  analysis  are  discussed  in  Chapter  15  for  both  the 
stationary  and  the  cointegrated  cases. 

In  Part  V,  some  special  models  and  issues  for  multiple  time  series  are 
studied.  In  Chapter  16,  models  for  conditionally  heteroskedastic  series  are 
considered  and,  in  particular,  multivariate  generalized  autoregressive  condi¬ 
tionally  heteroskedastic  (MG ARCH)  processes  are  presented  and  analyzed. 
In  Chapter  17,  VAR  processes  with  time  varying  coefficients  are  considered. 
The  coefficient  variability  may  be  due  to  a  one-time  intervention  from  out¬ 
side  the  system  or  it  may  result  from  seasonal  variation.  Finally,  in  Chapter 
18,  so-called  state  space  models  are  introduced.  The  models  represent  a  very 
general  class  which  encompasses  most  of  the  models  previously  discussed  and 
includes  in  addition  VAR  models  with  stochastically  varying  coefficients.  A 
brief  review  of  these  and  other  important  models  for  multiple  time  series  is 
given.  The  Kalman  filter  is  presented  as  an  important  tool  for  dealing  with 
state  space  models. 

The  reader  is  assumed  to  be  familiar  with  vectors  and  matrices.  The  rules 
used  in  the  text  are  summarized  in  Appendix  A.  Some  results  on  the  multivari¬ 
ate  normal  and  related  distributions  are  listed  in  Appendix  B  and  stochastic 
convergence  and  some  asymptotic  distribution  theory  are  reviewed  in  Ap¬ 
pendix  C.  In  Appendix  D,  a  brief  outline  is  given  of  the  use  of  simulation 
techniques  in  evaluating  properties  of  estimators  and  test  statistics.  Although 
it  is  not  necessary  for  the  reader  to  be  familiar  with  all  the  particular  rules  and 
propositions  listed  in  the  appendices,  it  is  implicitly  assumed  in  the  following 
chapters  that  the  reader  has  knowledge  of  the  basic  terms  and  results. 
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In  the  four  chapters  of  this  part,  finite  order,  stationary  vector  autoregres¬ 
sive  (VAR)  processes  and  their  uses  are  discussed.  Chapter  2  is  dedicated  to 
processes  with  known  coefficients.  Some  of  their  basic  properties  are  derived, 
their  use  for  prediction  and  analysis  purposes  is  considered.  Unconstrained 
estimation  is  discussed  in  Chapter  3,  model  specification  and  checking  the 
model  adequacy  are  treated  in  Chapter  4,  and  estimation  with  parameter 
restrictions  is  the  subject  of  Chapter  5. 


2 


Stable  Vector  Autoregressive  Processes 


In  this  chapter,  the  basic,  stationary  finite  order  vector  autoregressive  (VAR) 
model  will  be  introduced.  Some  important  properties  will  be  discussed.  The 
main  uses  of  vector  autoregressive  models  are  forecasting  and  structural  anal¬ 
ysis.  These  two  uses  will  be  considered  in  Sections  2.2  and  2.3.  Throughout 
this  chapter,  the  model  of  interest  is  assumed  to  be  known.  Although  this 
assumption  is  unrealistic  in  practice,  it  helps  to  see  the  problems  related  to 
VAR  models  without  contamination  by  estimation  and  specification  issues. 
The  latter  two  aspects  of  an  analysis  will  be  treated  in  detail  in  subsequent 
chapters. 


2.1  Basic  Assumptions  and  Properties  of  VAR  Processes 

2.1.1  Stable  VAR(p)  Processes 

The  object  of  interest  in  the  following  is  the  VAR(p)  model  (VAR  model  of 
order  p), 


Vt  —  v  +  •  •  •  +  Apyt_p  +  ut,  t  —  0,  ±1,  ±2, . . . ,  (2.1.1) 

where  yt  =  (yit,  •  •  ■ ,  UKt)'  is  a  ( K  x  1)  random  vector,  the  A;  are  fixed  ( K  x  K) 
coefficient  matrices,  v  =  (y i, . . . ,  i'k)'  is  a  fixed  ( K  x  1)  vector  of  intercept 
terms  allowing  for  the  possibility  of  a  nonzero  mean  E(yt).  Finally,  ut  = 
(u\t,  ■  ■  ■  ,uKty  is  a  A’-dimensional  white  noise  or  innovation  process,  that  is, 
E(ut)  =  0,  E(utu't)  =  Eu  and  E(utu's)  =  0  for  s  Y-  t.  The  covariance  matrix 
Eu  is  assumed  to  be  nonsingular  if  not  otherwise  stated. 

At  this  stage,  it  may  be  worth  thinking  a  little  more  about  which  process 
is  described  by  (2.1.1).  In  order  to  investigate  the  implications  of  the  model 
let  us  consider  the  VAR(l)  model 


yt  =  is  +  A^t-i  +  ut. 


(2.1.2) 
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If  this  generation  mechanism  starts  at  some  time  t  =  1,  say,  we  get 
yi  =  v  +  Aryo  +  ui, 

2/2  =  v  +  Aryx  +  u2  =  v  +  A^is  +  Aryo  +  m)  +  u2 
=  (Ik  +  Ai)v  +  Aly0  +  AiUi  +  U2, 

:  (2-1-3) 

t-i 

2/t  =  (-1/c  +  -Ai  +  •  •  •  +  A*  1)^  +  A^yg  +  ^ 

i—0 


Hence,  the  vectors  2/1,  •  •  • ,  2/t  are  uniquely  determined  by  2/o5  u\, . . . ,  ut.  Also, 
the  joint  distribution  of  2/1 ,  •  •  • ,  2/t  is  determined  by  the  joint  distribution  of 

Vo  j  '«i , . . . ,  Ut . 

Although  we  will  sometimes  assume  that  a  process  is  started  in  a  specified 
period,  it  is  often  convenient  to  assume  that  it  has  been  started  in  the  infinite 
past.  This  assumption  is  in  fact  made  in  (2.1.1).  What  kind  of  process  is  con¬ 
sistent  with  the  mechanism  (2.1.1)  in  that  case?  To  investigate  this  question 
we  consider  again  the  VAR(l)  process  (2.1.2).  From  (2.1.3)  we  have 

yt  =  v  +  Aryt-i  +  ut 

3 

=  (Ik  +  A\  +  ■  •  •  +  A{)i/  +  AJ1+1yt-j-i  +  A\ut-i- 

i—0 

If  all  eigenvalues  of  A\  have  modulus  less  than  1,  the  sequence  A\,  i  =  0,1,..., 
is  absolutely  summable  (see  Appendix  A,  Section  A. 9.1).  Hence,  the  infinite 
sum 


OO 

2  =  1 

exists  in  mean  square  (Appendix  C,  Proposition  C.9).  Moreover, 

(Ik  +  A\  +  ■  ■  ■  +  A^x)v  — »  (Ik  —  A\)  1is 

j—>oo 

(Appendix  A,  Section  A. 9.1).  Furthermore,  Al^1  converges  to  zero  rapidly  as 
j  — >  00  and,  thus,  we  ignore  the  term  A:l1+1yt_j^i  in  the  limit.  Hence,  if  all 
eigenvalues  of  Ai  have  modulus  less  than  1,  by  saying  that  2 It.  is  the  VAR(l) 
process  (2.1.2)  we  mean  that  yt  is  the  well-defined  stochastic  process 

OO 

yt  =  H  +  '52A\ut-i,  t  =  0,±1,±2,...,  (2.1.4) 

i—0 


where 
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A*  :=  {Ik  -  M)  1v. 

The  distributions  and  joint  distributions  of  the  yt  s  are  uniquely  determined 
by  the  distributions  of  the  ut  process.  From  Appendix  C.3,  Proposition  C.10, 
the  first  and  second  moments  of  the  yt  process  are  seen  to  be 

E(yt)  =  y  for  all  t  (2.1.5) 

and 

ry(h)  :=  E(yt  -  / i}(yt-h  ~  At)' 

n  n 

=  lim  ^'52AiE(ut-iU,t-h-j){A{y  (2.1.6) 

i= 0  j= 0 

n  do 

=  lim  ^2  A^EuA[f  =  ^  A™EUA\', 
i= 0  i— 0 

because  E(utu's)  =  0  for  s  ^  t  and  E(utu't )  =  Eu  for  all  t. 

Because  the  condition  for  the  eigenvalues  of  the  matrix  A\  is  of  importance, 
we  call  a  VAR.(l)  process  stable  if  all  eigenvalues  of  A\  have  modulus  less  than 
1.  By  Rule  (7)  of  Appendix  A. 6,  the  condition  is  equivalent  to 

det {Ik  —  A±z)  ^  0  for  \z\  <  1.  (2-1.7) 


It  is  perhaps  worth  pointing  out  that  the  process  yt  for  t  =  0,  ±1,  ±2, . . .  may 
also  be  defined  if  the  stability  condition  (2.1.7)  is  not  satisfied.  We  will  not 
do  so  here  because  we  will  always  assume  stability  of  processes  defined  for  all 
t  6  Z. 

The  previous  discussion  can  be  extended  easily  to  VAR(p)  processes  with 
p  >  1  because  any  VAR(p)  process  can  be  written  in  VAR(l)  form.  More 
precisely,  if  yt  is  a  VAR(p)  as  in  (2.1.1),  a  corresponding  A’p-dimerisional 
VAR(l) 

Yt  =  v  +  AK_!  +  Ut  (2.1.8) 


can  be  defined,  where 


yt 

V 

yt- 1 

0 

Yt  := 

yt-p+i  \ 

,  v  := 

_  0  _ 

{Kpx  1)  (Kpxl) 


A\  A2  ...  Ap_  1  Ap 

IK  0  ...  0  0 

0  IK  0  0 

0  0  . . .  I K  0 

(. KpxKp ) 


Ut  := 


ut 

0 


L  0  J 

(Kpxl) 
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Following  the  foregoing  discussion,  Yt  is  stable  if 

det (Ikp  —  Ar)  7^  0  for  \z\  <  1.  (2.1.9) 

Its  mean  vector  is 

H  :=  E(Yt)  =  (lKp  -  A)-1!/ 


and  the  autocovariances  are 

OO 

rY(h )  =  J2  Ah+iEu{Ai)',  (2.1.10) 

i= 0 

where  :=  E(UtU{.).  Using  the  ( K  x  Kp)  matrix 

•/:-  /k  :():•••:()•  (2.1.11) 

the  process  yt  is  obtained  as  yt  =  dUt.  Because  Ft  is  a  well-defined  stochastic 
process,  the  same  is  true  for  yt.  Its  mean  is  E(yt)  =  J/x  which  is  constant  for 
all  t  and  the  autocovariances  Ey(h)  =  JI>-(/i)J'  are  also  time  invariant. 

It  is  easy  to  see  that 

det  (Ikp  ~  A z)  =  det  (Ik  —  A±  z  —  ■  ■  ■  —  Apzp) 

(see  Problem  2.1).  Given  the  definition  of  the  characteristic  polynomial  of  a 
matrix,  we  call  this  polynomial  the  reverse  characteristic  polynomial  of  the 
VAR(p)  process.  Hence,  the  process  (2.1.1)  is  stable  if  its  reverse  characteristic 
polynomial  has  no  roots  in  and  on  the  complex  unit  circle.  Formally  yt  is  stable 
if 


det  {Ik  —  z  —  ■  ■  ■  —  Apzp)  ^  0  for  \z\  <  1.  (2.1.12) 

This  condition  is  called  the  stability  condition. 

In  summary,  we  say  that  yt  is  a  stable  VAR.(p)  process  if  (2.1.12)  holds 


yt  =  JYt  =  Jii  +  j'52AiUt-i-  (2-1.13) 

»= 0 

Because  the  Ut  :=  (u't,  0, . . . ,  0)'  involve  the  white  noise  process  ut,  the  process 
yt  is  seen  to  be  determined  by  its  white  noise  or  innovation  process.  Often 
specific  assumptions  regarding  ut  are  made  which  determine  the  process  yt  by 
the  foregoing  convention.  An  important  example  is  the  assumption  that  Ut  is 
Gaussian  white  noise,  that  is,  Ut  ~  A/"(0,  Eu)  for  all  t  and  ut  and  us  are  inde¬ 
pendent  for  s  ^  t.  In  that  case,  it  can  be  shown  that  yt  is  a  Gaussian  process, 
that  is,  subcollections  yt,  ■  ■  ■ ,  yt+h  have  multivariate  normal  distributions  for 
all  t  and  h. 
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The  condition  (2.1.12)  provides  an  easy  tool  for  checking  the  stability  of 
a  VAR  process.  Consider,  for  instance,  the  three-dimensional  VAR(l)  process 

"  .5  0  0  ' 

yt  =  v+  .1  .1  .3  yt- i+ut.  (2.1.14) 

0  .2  .3 

For  this  process  the  reverse  characteristic  polynomial  is 

/  [  1  0  0  1  [  .5  0  0  1  \ 

det  0  10  -  .1  .1  .3  z\ 

\  [  0  0  1  J  |_  0  .2  .3  J  / 

"  1  -  .5z  0  0 

=  det  — .1  z  1  —  .1  z  —.3  2 

0  —.2  z  1  —  .3  z 

=  (1  -  .5,z) (1  -  Az  -  .03z2). 

The  roots  of  this  polynomial  are  easily  seen  to  be 

Z!  =  2,  22  =  2.1525,  23  =  -15.4858. 

They  are  obviously  all  greater  than  1  in  absolute  value.  Therefore  the  process 
(2.1.14)  is  stable. 

As  another  example  consider  the  bivariate  (two-dimensional)  VAR(2)  pro¬ 
cess 

,  f  -5  .1  1  ,[001 

Vt  —  v  +  ^  g  IJt- 1+  25  o  (2.1.15) 

Its  reverse  characteristic  polynomial  is 

det  (  n  ?  —  i  -2  —  OC  n  z2')  =  1  —  z  +  .21z2  —  .025z3 . 

y  [  0  1  J  [  A  .5  J  [  .25  0  J  J 

The  roots  of  this  polynomial  are 

z\  =  1.3  ,  22  =  3.55  +  4.26t,  and  23  =  3.55  -  4.26b 

Here  i  :=  \/— 1  denotes  the  imaginary  unit.  Note  that  the  modulus  of  22  and 
23  is  I22 1  =  I23 1  =  V3.552  +  4.262  =  5.545.  Thus,  the  process  (2.1.15)  satisfies 
the  stability  condition  (2.1.12)  because  all  roots  are  outside  the  unit  circle. 
Although  the  roots  for  higher  dimensional  and  higher  order  processes  are  often 
difficult  to  compute  by  hand,  efficient  computer  programs  exist  that  do  the 
job. 

To  understand  the  implications  of  the  stability  assumption,  it  may  be 
helpful  to  visualize  time  series  generated  by  stable  processes  and  contrast 
them  with  realizations  from  unstable  VAR  processes.  In  Figure  2.1  three  pairs 
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of  time  series  generated  by  three  different  stable  bivariate  (two-dimensional) 
VAR  processes  are  depicted.  Although  they  differ  considerably,  a  common 
feature  is  that  they  fluctuate  around  constant  means  and  their  variability 
(variance)  does  not  change  as  they  wander  along.  In  contrast,  the  pairs  of 
series  plotted  in  Figures  2.2  and  2.3  are  generated  by  unstable,  bivariate  VAR 
processes.  The  time  series  in  Figure  2.2  have  a  trend  and  those  in  Figure 
2.3  exhibit  quite  pronounced  seasonal  fluctuations.  Both  shapes  are  typical 
of  certain  instabilities  although  they  are  quite  common  in  practice.  Hence, 
the  stability  assumption  excludes  many  series  of  practical  interest.  We  shall 
therefore  discuss  unstable  processes  in  more  detail  in  Part  II.  For  that  analysis 
understanding  the  stable  case  first  is  helpful. 


2.1.2  The  Moving  Average  Representation  of  a  VAR  Process 

In  the  previous  subsection  we  have  considered  the  VAR(l)  representation 


Yf  =  u  +  A  Vt_!  +  Ut 


of  the  VAR.(p)  process  (2.1.1).  Under  the  stability  assumption,  the  process  Yt 
has  a  representation 

OO 

Yt  =  H  +  Y,AilJt-i-  (2-1.16) 

i—0 

This  form  of  the  process  is  called  the  moving  average  (MA)  representation, 
where  Yt  is  expressed  in  terms  of  past  and  present  error  or  innovation  vectors 
Ut  and  the  mean  term  fi.  This  representation  can  be  used  to  determine  the 
autocovariances  of  Yt  and  the  mean  and  autocovariances  of  yt.  can  be  obtained 
as  outlined  in  Section  2.1.1.  Moreover,  an  MA  representation  of  yt  can  be  found 
by  premultiplying  (2.1.16)  by  the  ( K  x  Kp)  matrix  J  :=  [Ik  :  0  :  •  •  •  :  0] 
(defined  in  (2.1.11)), 

OO 

yt  =  Jy~t  =  J  n  H-  J  A 1  J'  J  Ut—j 

i=0 

OO 

=  n  +  ’^2<PlUt-i.  (2.1.17) 

»= o 

Here  p  :=  J/x,  :=  ./A'./'  and,  due  to  the  special  structure  of  the  white 
noise  process  Ut,  we  have  Ut  =  J'  JUt  and  JUt  =  ut  .  Because  the  A*  are 
absolutely  summable,  the  same  is  true  for  the  <Pi. 

Later  we  will  also  consider  other  MA  representations  of  a  stable  VAR(p) 
process.  The  unique  feature  of  the  present  representation  is  that  the  zero  order 
coefficient  matrix  =  I k  and  the  white  noise  process  involved  consists  of  the 
error  terms  ttt  of  the  VAR  representation  (2.1.1).  In  Section  2.2.2,  the  Ut  will 
be  seen  to  be  the  errors  of  optimal  forecasts  made  in  period  t  —  1.  Therefore, 
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Fig.  2.3.  Unstable  seasonal  time  series. 


to  distinguish  the  present  representation  from  other  MA  representations,  we 
will  sometimes  refer  to  it  as  the  canonical  or  fundamental  or  prediction  error 
representation. 

Using  Proposition  C.10  of  Appendix  C.3,  the  representation  (2.1.17)  pro¬ 
vides  a  possibility  for  determining  the  mean  and  autocovariances  of  yp. 

E(Vt)  =  V 

and 

ry(h)  =  E[(yt  -  y){yt-h  ~  m)'] 

(h—  1  OO  \  /  OO 

^  ^  H-  ^  ^  ^h+i'U't—  h— i  j  (  ^  ^  & i^t—h—i 

2=0  2  =  0  /  \  2=0 

OO 

= 

2=0 


(2.1.18) 
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There  is  no  need  to  compute  the  MA  coefficient  matrices  <Pi  via  the  VAR(l) 
representation  corresponding  to  yt.  as  in  the  foregoing  derivation.  A  more 
direct  way  for  determining  these  matrices  results  from  writing  the  VAR(p) 
process  in  lag  operator  notation.  The  lag  operator  L  is  defined  such  that 
Lyt  =  yt- 1,  that  is,  it  lags  (shifts  back)  the  index  by  one  period.  Because  of 
this  property  it  is  sometimes  called  backshift  operator.  Using  this  operator, 
(2.1.1)  can  be  written  as 


Vt  —  v  +  (dli  L  +  •  •  •  +  ApLp)yt  +  ut 


or 


A(L)yt  =  v  +  ut,  (2.1.19) 

where 

A(L)  :=  IK  -  AiL - ApLp. 


Let 


$(L)  ~Y,*iLi 

i= 0 

be  an  operator  such  that 


${L)A(L)  =  1K. 


(2.1.20) 


Premultiplying  (2.1.19)  by  <P(L)  gives 


Vt 


<P(L)  v  +  <P(L)ut 

/  oo  \  oo 

(  Pi  )  v  + 


\i— 0 


i= 0 


(2.1.21) 


The  operator  <P(L)  is  the  inverse  of  A(L)  and  it  is  therefore  sometimes  denoted 
by  A(L)-1.  Generally,  we  call  the  operator  A(L)  invertible  if  |A(z)  7^  0  for 
\z\  <  1.  If  this  condition  is  satisfied,  the  coefficient  matrices  of  <P(L)  =  A(L )_1 
are  absolutely  summable  and,  hence,  the  process  <£(L)ut  =  A(L)_1itt  is  well- 
defined  (see  Appendix  C.3).  The  coefficient  matrices  <&i  can  be  obtained  from 
(2.1.20)  using  the  relations 


Ik 


(fPo  T  &1L  +  I>2 L"  +  •  •  •  ){I k  —  A\L  —  ...  —  APLP) 
(Pq  +  ( (h  1  —  ^0  A\)L  +  (<?2  —  I’iAi  —  (PqA2)L~  +  •  •  • 


+  $i 


Y.<P'-3A1 

3=1 


u  +  - 


or 
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Ik  —  0O 

0  =  0i  —  0oHi 

0  =  02  —  0lHi  —  I)0A2 

i 

o  =  <i>,  ,a, 

3  =  1 


where  Aj  =  0  for  j  >  p.  Hence,  the  0,;  can  be  computed  recursively  using 
0O  =  Ik  , 

i 

&i  =  52^-3  Av  *  =  1,2,....  (2.1.22) 

j'=i 

The  mean  p  of  yt  can  be  obtained  as  follows: 

p  =  0(1  )v  =  A{l)~1v  =  (IK  -  Hi - Hp)-1!/.  (2.1.23) 


For  a  VAR.(l)  process,  the  recursions  (2.1.22)  imply  that  0O  =  Ik,  0i  = 
Hi,  ...,  0.j  =  Hi,  ....  This  result  is  in  line  with  (2.1.4).  For  the  example 
VAR.(l)  process  (2.1.14),  we  get  0q  =  h, 


"  .5 

0 

0  ' 

"  .25 

0 

0 

0i  = 

.1 

.1 

.3 

,  02  = 

.06 

.07 

.12 

0 

.2 

.3 

.02 

.08 

.15 

<p3 


.125  0  0 

.037  .031  .057 
.018  .038  .069 


etc.  For  a  VAR(2),  the  recursions  (2.1.22)  result  in 
0i  =  Hi 

0  2  =  0iHi  +  H2  =  Hj  +  H2 

03  =  02H1  +  01H2  =  Hi  +  H2H1  +  H1H2 


(2.1.24) 


0»  =  0j-lHi  +  (I>i-2A2 


Thus,  for  the  example  VAR(2)  process  (2.1.15),  we  get  the  MA  coefficient 
matrices  0q  =  h, 


'  .5 

.1  ' 

'  .29 

.1 

'  .21 

.079  ' 

.4 

.5 

,  02  = 

.65 

.29 

,  ^3  = 

.566 

.21 

(2.1.25) 
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etc.  For  both  example  processes,  the  <Pi  matrices  approach  zero  as  i  oo. 
This  property  is  a  consequence  of  the  stability  of  the  two  processes. 

It  may  be  worth  noting  that  the  MA  representation  of  a  stable  VAR(p) 
process  is  not  necessarily  of  infinite  order.  That  is,  the  may  all  be  zero  for 
i  greater  than  some  finite  integer  q.  For  instance,  for  the  bivariate  VAR(l) 


Vt  =  v  + 


0  a 
0  0 


yt-1  +  Ut, 


the  MA  representation  is  easily  seen  to  be 


yt  =  y  +  ut  + 

because 

0  Q  |  =0 
0  0  1 


0  a 
0  0 


ut- 1, 


for  i  >  1. 


2.1.3  Stationary  Processes 

A  stochastic  process  is  stationary  if  its  first  and  second  moments  are  time 
invariant.  In  other  words,  a  stochastic  process  yt  is  stationary  if 

E(yt)  =  y  for  all  t  (2.1.26a) 


and 

E[(yt  -  y){yt-h  -  y)']  =  rv(h)  =  ry{~h)'  for  t  and  h  =  0, 1,  2, - 

(2.1.26b) 

Condition  (2.1.26a)  means  that  all  yt  have  the  same  finite  mean  vector  y  and 
(2.1.26b)  requires  that  the  autocovariances  of  the  process  do  not  depend  on  t 
but  just  on  the  time  period  h  the  two  vectors  yt  and  yt-h  are  apart.  Note  that, 
if  not  otherwise  stated,  all  quantities  are  assumed  to  be  finite.  For  instance,  y  is 
a  vector  of  finite  mean  terms  and  Ey{h)  is  a  matrix  of  finite  covariances.  Other 
definitions  of  stationarity  are  often  used  in  the  literature.  For  example,  the 
joint  distribution  of  n  consecutive  vectors  may  be  assumed  to  be  time  invariant 
for  all  n.  We  shall,  however,  use  the  foregoing  definition  in  the  following.  We 
call  a  process  strictly  stationary  if  the  joint  distributions  of  n  consecutive 
variables  are  time  invariant  and  there  is  a  reason  to  distinguish  between  our 
notion  of  stationarity  and  the  stricter  form.  By  our  definition,  the  white  noise 
process  ut  used  in  (2.1.1)  is  an  obvious  example  of  a  stationary  process.  Also, 
from  (2.1.18)  we  know  that  a  stable  VAR(p)  process  is  stationary.  We  state 
this  fact  as  a  proposition. 
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Proposition  2.1  ( Stationarity  Condition) 

A  stable  VAR(p)  process  yt,  t  =  0,  ±1,  ±2, . . . ,  is  stationary.  ■ 

Because  stability  implies  stationarity,  the  stability  condition  (2.1.12)  is 
often  referred  to  as  stationarity  condition  in  the  time  series  literature.  The 
converse  of  Proposition  2.1  is  not  true.  In  other  words,  an  unstable  process  is 
not  necessarily  nonstationary.  Because  unstable  stationary  processes  are  not 
of  interest  in  the  following,  we  will  not  discuss  this  possibility  here. 

At  this  stage,  it  may  be  worth  thinking  about  the  generality  of  the  VAR(p) 
processes  considered  in  this  and  many  other  chapters.  In  this  context,  an 
important  result  due  to  Wold  (1938)  is  of  interest.  He  has  shown  that  every 
stationary  process  Xt  can  be  written  as  the  sum  of  two  uncorrelated  processes 
zt  and  yt, 


xt  =  zt  +  yt , 


where  Zt  is  a  deterministic  process  that  can  be  forecast  perfectly  from  its  own 
past  and  yt  is  a  process  with  MA  representation 

OO 

yt  =  (2.1.27) 

i= o 

where  <Pq  =  Ik,  the  ut  constitute  a  white  noise  process  and  the  infinite  sum 
is  defined  as  a  limit  in  mean  square  although  the  T>i  are  not  necessarily  abso¬ 
lutely  summable  (Hannan  (1970,  Chapter  III)).  The  term  “deterministic”  will 
be  explained  more  formally  in  Section  2.2.  This  result  is  often  called  Wold’s 
Decomposition  Theorem.  If  we  assume  that  in  the  system  of  interest  the  only 
deterministic  component  is  the  mean  term,  the  theorem  states  that  the  sys¬ 
tem  has  an  MA  representation.  Suppose  the  T>i  are  absolutely  summable  and 
there  exists  an  operator  A(L)  with  absolutely  summable  coefficient  matrices 
satisfying  A{L)<P{L )  =  Ik-  Then  <P(L)  is  invertible  ( A(L )  =  <£(L)-1)  and  yt 
has  a  VAR  representation  of  possibly  infinite  order, 

OO 

yt  =  ^Aryt-i  +  ut,  (2.1.28) 

i= 1 


where 

OO  /  OO 

A(z)  :=  lK~  Y,A'zl  = 

»= i  V*=o 

The  At  can  be  obtained  from  the  fP,,  by  recursions  similar  to  (2.1.22). 

The  absolute  summability  of  the  At  implies  that  the  VAR  coefficient  ma¬ 
trices  converge  to  zero  rapidly.  In  other  words,  under  quite  general  conditions, 
every  stationary,  purely  nondeterministic  process  (a  process  without  a  deter¬ 
ministic  component)  can  be  approximated  well  by  a  finite  order  VAR  process. 


for  I z\  <  1. 
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This  is  a  very  powerful  result  which  demonstrates  the  generality  of  the  pro¬ 
cesses  under  study.  Note  that  economic  variables  can  rarely  be  predicted  with¬ 
out  error.  Thus  the  assumption  of  having  a  nondeterministic  system  except 
perhaps  for  a  mean  term  is  not  a  very  restrictive  one.  The  crucial  and  re¬ 
strictive  condition  is  the  stationarity  of  the  system,  however.  We  will  consider 
nonstationary  processes  later.  For  that  discussion  it  is  useful  to  understand 
the  stationary  case  first. 

An  important  implication  of  Wold’s  Decomposition  Theorem  is  worth  not¬ 
ing  at  this  point.  The  theorem  implies  that  any  subprocess  of  a  purely  nonde¬ 
terministic,  stationary  process  yt  consisting  of  any  subset  of  the  components 
of  yt  also  has  an  MA  representation.  Suppose,  for  instance,  that  interest  cen¬ 
ters  on  the  first  M  components  of  the  A'-dimensional  process  yt,  that  is,  we 
are  interested  in  Xt  =  Fyt ,  where  F  =  [I m  '■  0]  is  an  (M  x  K )  matrix.  Then 
E(xt)  —  FE(yt)  =  Fy  and  rx(h)  =  FFy(h)F'  and,  thus,  xt  is  stationary.  Ap¬ 
plication  of  Wold’s  theorem  then  implies  that  xt  has  an  MA  representation. 

2.1.4  Computation  of  Autocovariances  and  Autocorrelations  of 
Stable  VAR  Pro  cesses 

Although  the  autocovariances  of  a  stationary,  stable  VAR(p)  process  can  be 
given  in  terms  of  its  MA  coefficient  matrices  as  in  (2.1.18),  that  formula  is 
unattractive  in  practice,  because  it  involves  an  infinite  sum.  For  practical 
purposes  it  is  easier  to  compute  the  autocovariances  directly  from  the  VAR 
coefficient  matrices.  In  this  section,  we  will  develop  the  relevant  formulas. 

Autocovariances  of  a  VAR(l)  Process 

In  order  to  illustrate  the  computation  of  the  autocovariances  when  the  process 
coefficients  are  given,  suppose  that  yt  is  a  stationary,  stable  VAR(l)  process 


yt  =  v  +  Aiyt-i  +  ut 

with  white  noise  covariance  matrix  E(utu't)  =  Eu.  Alternatively,  the  process 
may  be  written  in  mean-adjusted  form  as 

yt  -  p  =  -  p)  +  ut,  (2.1.29) 

where  y  =  E(yt),  as  before.  Postmultiplying  by  ( yt-h  —  pY  and  taking  expec¬ 
tations  gives 

F[{yt  -  p)(yt-h  -  pY]  =  A1E[(yt_1  -  y){yt-h  -  p)']  +  E[ut{yt-h  -  p)'\- 
Thus,  for  h  =  0, 

Fy{  0)  =  A^yi-l)  +  Eu  =  A^yiiy  +  Eu  (2.1.30) 


and  for  h  >  0 
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ly(h)=A1ry(h- 1).  (2.1.31) 

These  equations  are  usually  referred  to  as  Yule-Walker  equations.  If  A\  and 
the  covariance  matrix  =  Uy  of  yt  are  known,  the  l'y(h)  can  be  computed 

recursively  using  (2.1.31). 

If  A\  and  Eu  are  given,  ry( 0)  can  be  determined  as  follows.  For  h  =  1,  we 
get  from  (2.1.31),  ry(  1)  =  Hi/l^O).  Substituting  AiTy(0)  for  7^,(1)  in  (2.1.30) 
gives 

7  y  (0)  =  All  'y(O)  +  Su 


or 

vedy(O)  =  vec(AiTy(0)A,1)  +  vec  Su 

—  {Ax  ®  Ai)  vec  ry(0)  +  vec  Su. 

(For  the  definition  of  the  Kronecker  product  the  vec  operator  and  the  rules 
used  here,  see  Appendix  A).  Hence, 

vec  ry(0)  =  (Ik2  —  Ai  ®  Hi)-1  vec  Su.  (2.1.32) 


Note  that  the  invertibility  of  1K 2  —  A\®  Ai  follows  from  the  stability  of  yt 
because  the  eigenvalues  of  A1  (g>  Ai  are  the  products  of  the  eigenvalues  of  A\ 
(see  Appendix  A).  Hence,  the  eigenvalues  of  A\  <g>  A\  have  modulus  less  than 
1.  Consequently,  det(IK2  —  A\  ®  Hi)  ■=/=■  0  (see  Appendix  A. 9.1). 

Using,  for  instance, 


2.25  0  0 

0  1.0  .5 

0  .5  .74 


(2.1.33) 


we  get  for  the  example  process  (2.1.14), 
vecUy(O)  =  (Iy  -  Ai  ®  Hi)-1  vec  Su 


.75 
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'  3.000  " 

.161 

.019 

.161 

=  1.172  . 

.674 

.019 

.674 

.954 

It  follows  that 


^,(0) 


3.000 

.161 

.019  " 

.161 

1.172 

.674 

.019 

.674 

.954 

"  1.500 

.080 

.009 

'«(!)  = 

=  A1ry(o)  = 

.322 

.335 

.355 

.038 

.437 

.421 

'  .750 

.040 

.005  ' 

rv(2)  = 

=  Airy(i)  = 

.194 

.173 

.163 

.076 

.198 

.197 

(2.1.34) 


Note  that  the  results  are  rounded  after  the  computation.  A  higher  precision 
has  been  used  in  intermediate  steps. 


Autocovariances  of  a  Stable  VAR(p)  Process 

For  a  higher  order  VAR(p)  process, 

Vt  ~  fj>  =  M{yt-1  -  n)  H - 1-  A p{yt-p  -  fj.)  +  Ut,  (2.1.35) 

the  Yule-Walker  equations  are  also  obtained  by  postmultiplying  with  ( yt-h  ~ 
y)'  and  taking  expectations.  For  h  =  0,  using  ry(i)  =  ry(—i)', 

l'y(0)  =  Airy(—l)  +  ---  +  Aply{—p)  +  Su 

=  Ail  y(l/  +  •  •  •  +  Apry(p)'  +  Eu,  (2.1.36) 

and  for  h  >  0, 

ry(h)  =  A1ry(h-i)  +  ---  +  APrv(h-p).  (2.1.37) 

These  equations  may  be  used  to  compute  the  Vy{h )  recursively  for  h  >  p,  if 
A±, . . . ,  Ap  and  1  y(p  —  1), ... ,  t’y(O)  are  known. 

The  initial  autocovariance  matrices  for  \h\  <  p  can  be  determined  using 
the  VAR(l)  process  that  corresponds  to  (2.1.35), 


Yt  —  y  —  A  (Yt-i  —  fi)  +  Ut , 


(2.1.38) 
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where  Yt.  A,  and  Ut  are  as  in  (2.1.8)  and  p  :=  (//, . . . ,  p')1  =  E(Yt). 
ing  as  in  the  VAR.(l)  case  gives 


IV  (0)  =  A/V(0)A'  +  Eu, 
where  Ejj  =  E(UtU't)  and 


( 

yt-  p 

\ 

. 

: 

[( yt  -  p)',  •  •  • ,  (yt-p+i  -  p)'} 

V 

.  yt-p+i  -  y  _ 

) 

7,(0)  7,(1)  ...  ly(p  —  1) 

7,(-l)  7,(0)  ...  ry(p  —  2) 

_ry(-p  + 1)  ry(-p  +  2)  ...  71,(0)  _ 

Thus,  the  71, (7i),  h  =  —p  +  1, . ...  ,p  —  1,  are  obtained  from 
vec  7 y  (0)  =  (I^Kpy  —  A  (g)  A)-1  vec  Ejj- 
For  instance,  for  the  example  VAR.(2)  process  (2.1.15)  we  get 


.5  .1  0  0 

.4  .5  .25  0 

10  0  0 

0  10  0 


and,  assuming 


.09  0 

0  .04  ’ 


we  have 


Eu  — 


Eu  0 
0  0 


.09  0  0  0 

0  .04  0  0 

0  0  0  0 

0  0  0  0 


Hence,  using  (2.1.39)  and 


7  V  (0) 


7,(0)  7,(1)- 
7,(1)'  7,(0) 


gives 


^»(0) 


.131  .066 
.066  .181  ’ 


Ey(l) 


.072  .051 
.104  .143  ’ 


Proceed- 


(2.1.39) 


(2.1.40) 


(2.1.41) 
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ry(  2)  =  AYry{  i)  +  A2ry{o) 


.046  .040 
.113  .108  ’ 


ry(3)  =  A1ry(2)  +  A2ry(i) 


.035  .031 
.093  .083  ’ 


(2.1.42) 


and  so  on.  A  method  for  computing  r,y{ 0)  without  explicitly  inverting  (/  — 
A  A)  is  given  by  Barone  (1987). 

The  autocovariance  function  of  a  stationary  VAR(p)  process  is  positive 
semidefinite,  that  is, 


j—0 i=0 


ry(  0)  /;.,(!) 

r.y(n) 

a0 

(a0,  •  ■  •  1  an) 

^y(-l)  '  :•/(<»  1 

■  ly{n-0 

a  1 

_  ry(-n)  ry(-n  + 1)  .. 

•  ry(  0)  . 

nn 

(2.1.43) 

for  any  n  >  0.  Here  the  at  are  arbitrary  (K  x  1)  vectors.  This  result  follows 
because  (2.1.43)  is  just  the  variance  of 


(ao>  •  •  • )  an ) 


Vt 

yt-i 


Vt—n 


which  is  always  nonnegative. 


Autocorrelations  of  a  Stable  VAR(p)  Process 

Because  the  autocovariances  depend  on  the  unit  of  measurement  used  for  the 
variables  of  the  system,  they  are  sometimes  difficult  to  interpret.  Therefore, 
the  autocorrelations 

Ry(h)  =  D-'lyi^D-1  (2.1.44) 

are  usually  more  convenient  to  work  with  as  they  are  scale  invariant  measures 
of  the  linear  dependencies  among  the  variables  of  the  system.  Here  I?  is  a 
diagonal  matrix  with  the  standard  deviations  of  the  components  of  yt  on  the 
main  diagonal.  That  is,  the  diagonal  elements  of  D  are  the  square  roots  of  the 
diagonal  elements  of  Ty(0).  Denoting  the  covariance  between  ?/jit  and  yj.t-h 
by  'Yij(h)  (i.e.,  7 ij(h)  is  the  ij- th  element  of  ry{h))  the  diagonal  elements 
7n(0),  ■  •  • ,  Jkk(0)  of  rv{ 0)  are  the  variances  of  ylu  . . . ,  yKt.  Thus, 
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D~1 


1/ V7n(0)  0 

0  1/ (0)  . 


and  the  correlation  between  yi  t  and  Ujtt-h  is 


Pij  0) 


liAh) 

v/7«(0)\/7ij(0) 


(2.1.45) 


which  is  just  the  ij-tli  element  of  Ry(h). 

For  the  VAR(l)  example  process  (2.1.14)  we  get  from  (2.1.34), 


'  V3T000 

0 

0 

"  1.732 

0 

0 

D  = 

0 

VIJ72 

0 

= 

0 

1.083 

0 

0 

0 

V.954 

0 

0 

.977 

1 

.086 

.011 

Ry{  0)  = 

=  zr 

-^(O  )D-1  = 

.086 

1 

.637 

_  .011 

.637 

1 

'  .500 

.043 

.005 

Ry(  1)  = 

=  i) 

-^(1  )D~1  = 

.172 

.286 

.336 

.022 

.413 

.441 

'  .250 

.021 

.003 

Ry(  2)  = 

=  ir 

-1Fy(2)iZ-1  = 

.103 

.148 

.154 

.045 

.187 

.206 

(2.1.46) 


A  plot  of  some  autocorrelations  is  shown  in  Figure  2.4.  Assuming  that  the 
three  variables  of  the  system  represent  rates  of  change  of  investment,  income, 
and  consumption,  respectively,  it  can,  for  instance,  be  seen  that  the  contempo¬ 
raneous  and  intertemporal  correlations  between  consumption  and  investment 
are  quite  small,  while  the  patterns  of  the  autocorrelations  of  the  individual 
series  are  similar. 


2.2  Forecasting 

We  have  argued  in  the  introduction  that  forecasting  is  one  of  the  main  objec¬ 
tives  of  multiple  time  series  analysis.  Therefore,  we  will  now  discuss  predictors 
based  on  VAR  processes.  Point  forecasts  and  interval  forecasts  will  be  con¬ 
sidered  in  turn.  Before  discussing  particular  predictors  or  forecasts  (the  two 
terms  will  be  used  interchangeably)  we  comment  on  the  prediction  problem 
in  general. 
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/^investment,,  investment.,.) 


^(investment,,  income,.,,) 


J _ i _ _ _ , 

01234567/7 


^(investment,,  consumption, _h) 

It 


01  234567/7 


/^(income,,  investment, _h) 


p( income,,  income, _h) 


^(consumption,,  investment, _h) 


01  234567/7 


^(consumption,,  income, _„) 


^(consumption,,  consumption, _h) 


Fig.  2.4.  Autocorrelations  of  the  investment/income/consumption  system. 


2.2.1  The  Loss  Function 

The  forecaster  usually  finds  himself  in  a  situation  where  in  a  particular  period 
t  he  has  to  make  statements  about  the  future  values  of  variables  y± , . . . ,  yx  ■ 
For  this  purpose  he  has  available  a  model  for  the  data  generation  process  and 
an  information  set,  say  l?t,  containing  the  available  information  in  period  t. 
The  data  generation  process  may,  for  instance,  be  a  VAR(p)  process  and  f2t 
may  contain  the  past  and  present  variables  of  the  system  under  consideration, 
that  is,  fit  =  {ys|s  <  f},  where  ys  =  (yis,  ■  ■  ■  ,yKs)'-  The  period  t,  where  the 
forecast  is  made,  is  the  forecast  origin  and  the  number  of  periods  into  the 
future  for  which  a  forecast  is  desired  is  the  forecast  horizon.  A  predictor,  h 
periods  ahead,  is  an  h-step  predictor. 

If  forecasts  are  desired  for  a  particular  purpose,  a  specific  cost  function 
may  be  associated  with  the  forecast  errors.  A  forecast  will  be  optimal  if  it 
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minimizes  the  cost.  To  find  a  forecast  that  is  optimal  in  this  sense  is  usually 
too  ambitious  a  goal  to  be  attainable  in  practice.  Therefore,  minimizing  the 
expected  cost  or  loss  is  often  used  as  an  objective.  In  general,  it  will  depend 
on  the  particular  loss  function  which  forecast  is  optimal.  On  the  other  hand, 
forecasts  of  economic  variables  are  often  published  for  general  use.  In  that  case, 
the  specific  cost  or  loss  function  of  all  potential  users  cannot  be  taken  into 
account  in  computing  a  forecast.  In  this  situation,  the  statistical  properties 
of  the  forecasts  and  perhaps  interval  forecasts  are  of  interest  to  enable  the 
user  to  draw  proper  conclusions  for  his  or  her  particular  needs.  It  may  also 
be  desirable  to  choose  the  forecast  such  that  it  minimizes  a  wide  range  of 
plausible  loss  functions. 

In  the  context  of  VAR  models,  predictors  that  minimize  the  forecast  mean 
squared  errors  (MSEs)  are  the  most  widely  used  ones.  Arguments  in  favor  of 
using  the  MSE  as  loss  function  are  given  by  Granger  (1969b)  and  Granger 
&  Newbold  (1986).  They  show  that  minimum  MSE  forecasts  also  minimize  a 
range  of  loss  functions  other  than  the  MSE.  Moreover,  for  many  loss  functions 
the  optimal  predictors  are  simple  functions  of  minimum  MSE  predictors.  Fur¬ 
thermore,  for  an  unbiased  predictor,  the  MSE  is  the  forecast  error  variance 
which  is  useful  in  setting  up  interval  forecasts.  Therefore,  minimum  MSE  pre¬ 
dictors  will  be  of  major  interest  in  the  following.  If  not  otherwise  stated,  the 
information  set  Qt  is  assumed  to  contain  the  variables  of  the  system  under 
consideration  up  to  and  including  period  t. 


2.2.2  Point  Forecasts 
Conditional  Expectation 

Suppose  yt  =  (yit,  ■  ■  ■  ,VKt)'  is  a  A'-dimensional  stable  VAR(p)  process  as  in 
(2.1.1).  Then,  the  minimum  MSE  predictor  for  forecast  horizon  h  at  forecast 
origin  t  is  the  conditional  expected  value 

Et(yt+h)  ■=  E{yt+h\f}t)  =  E(yt+h\{ys\s  <  t}).  (2.2.1) 

This  predictor  minimizes  the  MSE  of  each  component  of  yt.  In  other  words, 
if  yt(h)  is  any  h- step  predictor  at  origin  t, 

MSE [yt(h)]  =  E[(yt+h  -  yt(h))(yt+h  -  yt(h))'} 

>  MSE[£t(yt+h)]  =  E[(yt+h  -  Et(yt+h))(yt+h  -  Et(yt+h))%  (2.2.2) 

where  the  inequality  sign  >  between  two  matrices  means  that  the  differ¬ 
ence  between  the  left-hand  and  the  right-hand  matrix  is  positive  semidefinite. 
Equivalently,  for  any  ( K  x  1)  vector  c, 

MSE[c'yt(/i)]  >  MSE [c1  Et(yt+h)]. 


The  optimality  of  the  conditional  expectation  can  be  seen  by  noting  that 
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MSE [yt{h)}  =  E{[yt+h  -  Et{yt+h)  +  Et(yt+h)  -  yt(h )] 

x  [yt+h  -  Et(yt+h)  +  Et{yt+h)  -  yt(h)]'} 

=  MSE  [Et(yt+h)} 

+  E{[Et(yt+h)  -  yt(h)][Et(yt+h)  -  yt(h )]'}, 

where  E{[yt+h  -  Et(yt+h)][Et(yt+h)  -  Vt(h)]'}  =  0  has  been  used.  The  latter 
result  holds  because  [yt.+h  —  Et(yt+h)\  is  a  function  of  innovations  after  period 
t  which  are  uncorrelated  with  the  terms  contained  in  [Et(yt+h)  —  yt(h)\  which 
are  functions  of  y„,  s  <t. 

The  optimality  of  the  conditional  expectation  implies  that 

Et(yt+h)  =  v  +  ^iEt(yt+fl_i)  +  •  •  •  +  ApEt(yt+fl_p )  (2.2.3) 

is  the  optimal  h- step  predictor  of  a  VAR(p)  process  yt,  provided  ut  is  inde¬ 
pendent  white  noise  so  that  ut  and  us  are  independent  for  s  ^  t  and,  hence, 
Et(ut+h)  =  0  for  h  >  0. 

The  formula  (2.2.3)  can  be  used  for  recursively  computing  the  h- step  pre¬ 
dictors  starting  with  h  =  1: 

Et.{yt+ 1)  =  v  +  Aiyt  +  ■  ■  ■  +  Apyt-p+i, 

Et.{yt+ 2)  =  v  +  AiEt(yt.+i)  +  A2yt  +  ■  ■  ■  +  Apyt-p+2i 


By  these  recursions  we  get  for  a  VAR(l)  process 
Et{yt+h)  =  (Ik  +  A\  +  •  •  •  +  A\  1)u  +  Aiyt. 

Assuming  yt  =  (—6,3,5)'  and  v  =  (0,2,1)',  the  following  forecasts  are 
obtained  for  the  VAR(l)  example  process  (2.1.14): 


Et(yt+ 1)  — 


'  0  ' 

2 

+ 

1 

.5  0  0 

.1  .1  .3 
0  .2  .3 


'  -6  " 

'  -3.0  ' 

3 

= 

3.2 

5 

3.1 

(2.2.4a) 


Et(yt+ 2)  —  (I3  +  A\)v  +  A\yt  — 


-1.50 

2.95 

2.57 


(2.2.4b) 


etc.  Similarly,  we  get  for  the  VAR(2)  process  (2.1.15)  with  v  =  (.02,  .03)', 
yt  =  (.06,  .03)'  and  yt^  =  (.055,  .03)', 


Et{yt+ 1)  = 


.02 
.03 

.053 

.08275 


+ 


.5  .1 
.4  .5 


.06 

.03 


0  0 

.25  0 


.055 

.03 
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Et(yt+ 2) 


.02 
.03 

.0548 

.1076 


+ 


.5  .1 
.4  .5 


.053 

.08275 


0  0 

.25  0 


.06 

.03 

(2.2.5) 


The  conditional  expectation  has  the  following  properties: 


(1)  It  is  an  unbiased  predictor,  that  is,  E[yt+h  —  Et(yt+h)]  =  0. 

(2)  If  ut  is  independent  white  noise,  MSE[£it(r/t+f()]  =  MSE[Et(yt+h)\yt,  yt_i, 
. . .],  that  is,  the  MSE  of  the  predictor  equals  the  conditional  MSE  given 
Vu  Ut- 1, 


The  latter  property  follows  by  similar  arguments  as  the  optimality  of  the 
predictor  Et(yt+h). 

It  must  be  emphasized  that  the  prediction  formula  (2.2.3)  relies  on  ut 
being  independent  white  noise.  If  ut  and  us  are  not  independent  but  just 
uncorrelated,  Et{ut+h )  will  be  nonzero  in  general.  As  an  example  consider  the 
univariate  AR(1)  process  yt  =  v  +  ayt- 1  +  ut  with 


j  et  for  t  =  0,  ±2,  ±4, . . . , 

Ut  =  {  (e?_i  -  l)/y/2  for  t  =  ±1,  ±3, . . . , 

where  the  e*  are  independent  standard  normal  (A/"(0, 1))  random  variables 
(see  also  Fuller  (1976,  Chapter  2,  Exercise  16)).  The  process  ut  is  easily  seen 
to  be  uncorrelated  but  not  independent  white  noise.  For  even  t, 

Et{ut+ 1)  =  E[{e\  -  l)/\f2\yt,yt-i,...) 

=  (e2t-l)/V2, 
because  e*  =  yt  —  v  —  ayt- 1. 


Linear  Minimum  MSE  Predictor 

If  at  is  n°t  independent  white  noise,  additional  assumptions  are  usually  re¬ 
quired  to  find  the  optimal  predictor  (conditional  expectation)  of  a  VAR(p) 
process.  Without  such  assumptions  we  can  achieve  the  less  ambitious  goal  of 
finding  the  minimum  MSE  predictors  among  those  that  are  linear  functions 
of  yt,yt- 1, - Let  us  consider  a  zero  mean  VAR(l)  process 


yt  =  Aiyt-i  +  ut 


(2.2.6) 


first.  As  in  (2.1.3),  it  follows  that 


h- 1 

Vt+h  =  AiVt  +  A\ut+h-i- 

i= 0 


Thus,  for  a  predictor 
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Vt(h)  =  B0yt  +  Biyt-i  H - , 


where  the  Bi  s  are  (K  x  K)  coefficient  matrices,  we  get  a  forecast  error 


h-l 

Vt+h  ~  Vtiti)  =  +  (^-l 

2=0 


oo 

B0 )yt  Biyt-i- 

2=1 


Using  that  ut+j ,  for  j  >  0,  is  uncorrelated  with  yt-i ,  for  i  >  0,  we  get 


MSE[j/t(/i)] 


=  E 


fh-1 


fh- 1 


E 


,2=0 


,2=0 

OO 


(A*  -  B0)yt 


(4* 


OO 

#o)j/t  -  X! 
2=1 


Obviously,  this  MSE  matrix  is  minimal  for  _B0  =  A)1  and  Bt  =  0.  Thus,  the 
optimal  ( linear  minimum  MSE)  predictor  for  this  special  case  is 

yt{h)  =  A'lyt  =  Aryt(h  -  1). 


The  forecast  error  is 

h- 1 

2=0 

and  the  MSE  or  forecast  error  covariance  matrix  is 


£„(/»)  :=  MSE[j/t(ft)]  =  E 


fh-1 


fh-1 


YsAut+h-i)  £  A\ut+h-i 


,2=0 


,2=0 


h-l 


=  £  A\Eu(A\)'  =  MSE [yt(h  -  1)]  +  Ah1-1Bu{Ah1~1y. 


2=0 

A  general  VAR(p)  process  with  zero  mean, 
Vt  =  Aiyt-i  +  •  •  •  +  Apyt-p  +  ut, 
has  a  VAR(l)  counterpart, 

Tt  =  A  Yf—i  +  Ut, 


where  Ytl  A,  and  Ut  are  as  defined  in  (2.1.8).  Using  the  same  arguments  as 
above,  the  optimal  predictor  of  Yt+h  is  seen  to  be 

Yt(h)  =  AhYt  =  AYt(h  —  1). 
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It  is  easily  seen  by  induction  with  respect  to  h  that 


Yt{h) 


yt(h ) 

yt(h  - 1) 
yt{h-p  + 1) 


where  yt(j)  '■=  yt+j  for  j  <  0.  Defining  the  (K  x  Kp )  matrix  J  :=  [lx  '■  0  : 
•  •  •  :  0]  as  in  (2.1.11),  we  get  the  optimal  h-step  predictor  of  the  process  yt  at 
origin  t  as 

Vt{h)  =  JAYt(h-l)  =  [A1,...,Ap]Yt(h-l) 

=  Aiyt{h  —  1)  +  •  •  •  +  Apyt(h  —  p).  (2.2.7) 

This  formula  may  be  used  for  recursively  computing  the  forecasts.  Obviously, 
yt(h)  is  the  conditional  expectation  Et(yt.+h)  if  Ut  is  independent  white  noise 
because  the  recursion  in  (2.2.3)  is  the  same  as  the  one  obtained  here  for  a 
zero  mean  process  with  v  =  0. 

If  the  process  yt  has  nonzero  mean,  that  is, 


yt  =  v  +  A^t-i  H - 1-  Apyt-p  +  ut , 

we  define  ay  :=  yt  —  p,  where  p  :=  E(yt)  =  ( I  —  A\  —  ■  ■  ■  —  Ap)~lu.  The  process 
ay  has  zero  mean  and  the  optimal  h- step  predictor  is 


xt{h)  =  Axxtih  -  1)  H - 1-  Apxt(h  -  p). 

Adding  p  to  both  sides  of  this  equation  gives  the  optimal  linear  predictor  of 

yt, 

yt(h)  =  xt(h)  +  p  =  p  +  Ai(t/t(/i  —  1)  —  p)  4 - 1 -  Ap(yt(h  -  p)  -  p) 

=  v  +  Aiyt(h  —  1)  +  •  •  •  +  Apyt(h  —  p).  (2.2.8) 

Henceforth,  we  will  refer  to  yt{h)  as  the  optimal  predictor  irrespective  of  the 
properties  of  the  white  noise  process  Ut,  that  is,  even  if  ut  is  not  independent 
but  just  uncorrelated  white  noise. 

Using 


h- 1 


Yt+h  —  Ah  Yt  +  'y  '  A '  Ut+h—i 


i= 0 

for  a  zero  mean  process,  we  get  the  forecast  error 

r^-i 


lUt+h-t 


Vt+h  ~  yt(h)  =  J[Yt+h  -  Yt(h )]  =  J 

I 

_i= 0 

h—1  h— 1 

EJA‘  J  JUt-\-h—i  —  ^  ^  ^>i'Ujt-\-h— ii 

i— 0  i=0 


(2.2.9) 
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where  the  3>i  are  the  MA  coefficient  matrices  from  (2.1.17).  The  forecast  error 
is  unchanged  if  yt  has  nonzero  mean  because  the  mean  term  cancels.  The 
forecast  error  representation  (2.2.9)  shows  that  the  predictor  yt(h)  can  also 
be  expressed  in  terms  of  the  MA  representation  (2.1.17), 

OO  OO 

yt(h )  =  M  +  $iUt+h-i  =  M  +  E  $h+iUt-i.  (2.2.10) 

i—h  i=0 

From  (2.2.9)  the  forecast  error  covariance  or  MSE  matrix  is  easy  to  obtain, 


h- 1 

S.y(h)  :=  MSE[yt(/i)]  =  ^  QiSu&i  =  Sy(h  -  1)  +  (2.2.11) 

i= 0 

Hence,  the  MSEs  are  monotonically  nondecreasing  and,  for  h  — >  oo,  the  MSE 
matrices  approach  the  covariance  matrix  of  yt, 


ry(  o)  =  zy  =  '£,*iZu&i 

i= 0 


(see  (2.1.18)).  That  is, 


Sy(h)  — *  Sy.  (2.2.12) 

h—>  oo 

If  the  process  mean  y  is  used  as  a  forecast,  the  MSE  matrix  of  that  predictor 
is  just  the  covariance  matrix  Sy  of  yt-  Hence,  the  optimal  long  range  forecast 
(h  oo)  is  the  process  mean.  In  other  words,  the  past  of  the  process  contains 
no  information  on  the  development  of  the  process  in  the  distant  future.  Zero 
mean  processes  with  this  property  are  purely  nondeterministic,  that  is,  yt  —  p 
is  purely  nondeterministic  if  the  forecast  MSEs  satisfy  (2.2.12). 

For  the  example  VAR(l)  process  (2.1.14)  with  Uu  as  in  (2.1.33),  using  the 
MA  coefficient  matrices  from  (2.1.24),  the  forecast  MSE  matrices 


•Sj/l)  —  2Ju  — 


2.25  0  0 

0  1.0  .5 

0  .5  .74 


•^2/(2)  —  2JU  + 


2.813  .113  0 

.113  1.129  .632 
0  .632  .907 


Sy(3)  =  Sy(2)  +  <P2Zu&2 


2.953  .146  .011 

.146  1.161  .663 
.011  .663  .943 


(2.2.13) 


are  obtained.  Similarly,  for  the  VAR(2)  example  process  (2.1.15)  with  white 
noise  covariance  matrix  (2.1.41),  we  get  with  <Pi  from  (2.1.25), 
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£‘y(  1)  —  SU 


.09  0 

0  .04  ’ 


Sy(2)  -  Su  +  <P^U<P\ 


.1129  .02 
.02  .0644 


(2.2.14) 


2.2.3  Interval  Forecasts  and  Forecast  Regions 

In  order  to  set  up  interval  forecasts  or  forecast  intervals,  we  make  an  assump¬ 
tion  about  the  distributions  of  the  yt  or  the  ut.  It  is  most  common  to  consider 
Gaussian  processes  where  yt,  yt+i,  ■  ■  ■ ,  Vt.+h  have  a  multivariate  normal  distri¬ 
bution  for  any  t  and  h.  Equivalently,  it  may  be  assumed  that  Ut  is  Gaussian, 
that  is,  the  ut  are  multivariate  normal,  Ut  ~  Af(0,£u),  and  ut  and  us  are 
independent  for  s  t. 

Under  these  conditions  the  forecast  errors  are  also  normally  distributed  as 
linear  transformations  of  normal  vectors, 


h- 1 

yt+h  -  yt{h)  =  ^2  ^iUt+h-i  ~  0,  Sy{h)). 

i= 0 


(2.2.15) 


This  result  implies  that  the  forecast  errors  of  the  individual  components  are 
normal  so  that 


yk,t+h  Vk,t{h) 
crk(h) 


~A/-(0,1), 


(2.2.16) 


where  yk,t(h )  is  the  fc-th  component  of  yt(h)  and  <Jk(h)  is  the  square  root  of  the 
&-th  diagonal  element  of  Ey(h).  Denoting  by  the  upper  al00  percentage 
point  of  the  normal  distribution,  we  get 


l- a  =  Pr|-z(ct/2)< 


yk,t+h  Vk,t{h) 
ak(h) 


<  Z(a/2) 


=  Pr  {Vk,t(h)  -  Z(a/2)(Jk{h)  <  yk,t+h  <  Vk,t(h)  +  z{a/2)<Jk(h)}  . 


Hence,  a  (1— a)  100%  interval  forecast,  h  periods  ahead,  for  the  fc-th  component 
of  yt  is 


Vk,t(h)  ±  Z(a/2)Vk{h) 


(2.2.17a) 


or 


[Vk,t{h)  -  Z(a/2)(Jk{h),yktt{h)  +  Z(a/2)<Jk(h)\.  (2.2.17b) 

If  forecast  intervals  of  this  type  are  computed  repeatedly  from  a  large  number 
of  time  series  (realizations  of  the  considered  process),  then  about  (1  —  a)  100% 
of  the  intervals  will  contain  the  actual  value  of  the  random  variable  yk,t+h- 
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Using  (2.2.4a),  (2.2.4b)  and  (2.2.13),  95%  forecast  intervals  for  the  com¬ 
ponents  of  the  example  VAR(l)  process  (2.1.14)  are 


2/i, t(l)  ±  1.9 6V^25  or 

-3.0  +  2.94, 

2/2, t(l)  ±  1.96\/L0  or 

3.2  +  1.96, 

2/3,* (1)  ±  1.96\/J4  or 

3.1  ±  1.69, 

(2.2.18) 

2/M( 2)  ±  1.96^2.813  or  -1.50  ±  3.29, 
y2,t{ 2)  ±  1.96V1.129  or  2.95  ±  2.08, 
2/3, t(2)  ±  1.96\/J07  or  2.57  ±  1.87. 


The  result  in  (2.2.15)  can  also  be  used  to  establish  joint  forecast  regions 
for  two  or  more  variables.  For  instance,  if  a  joint  forecast  region  for  the  first 
N  components  is  desired,  we  define  the  ( N  x  K)  matrix  F  :=  [ I ^  ■  0]  and 
note  that 

[Vt+h  -  yt(h)],F,[FEv(h)FT1F\yt+h  -  yt(h)\  ~  X2(/V)  (2.2.19) 

by  a  well-known  result  for  multivariate  normal  vectors  (see  Appendix  B). 
Hence,  the  y2(Af (-distribution  can  be  used  to  determine  a  (1  —  a)  100%  forecast 
ellipsoid  for  the  first  N  components  of  the  process. 

In  practice,  the  construction  of  the  ellipsoid  is  quite  demanding  if  N  is 
greater  than  two  or  three.  Therefore,  a  more  practical  approach  is  to  use 
Bonferroni’s  method  for  constructing  joint  confidence  regions.  It  is  based  on 
the  fact  that  for  events  £j , . . . ,  En  the  following  probability  inequality  holds: 

Pr(Ui  U  •  •  •  U  En)  <  Pr(Ui)  +  •  •  •  +  Pr (EN). 

Hence, 

(N  \  N 

i—l  )  i= 1 

where  denotes  the  complement  of  Ei.  Consequently,  if  E^  is  the  event  that 
yi,t+h  falls  within  an  interval 

N 

Pv(Fyt+h  E  Hi  x  ■  ■  ■  x  Hn)  >  1  -  ^  P !•(/•.)).  (2.2.20) 

i= 1 

In  other  words,  if  we  choose  a  (l  —  ^)100%  forecast  interval  for  each  of  the 
N  components,  the  resulting  joint  forecast  region  has  probability  at  least 
(1  —  a)  100%  of  containing  all  N  variables  jointly.  For  instance,  for  the  VAR(l) 
example  process  considered  previously, 

{(2/i, ?/2)|  -  3.0-2.94  <  2/1  <  -3.0  +  2.94,  3.2  -  1.96  <  y2  <  3.2+1.96} 
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is  a  joint  forecast  region  of  (i/qt+i,  2/2, t+i)  with  probability  content  at  least 
90%. 

By  the  same  method  joint  forecast  regions  for  different  horizons  h  can  be 
obtained.  For  instance,  a  joint  forecast  region  with  probability  content  of  at 
least  (1  -  a)  100%  for  yk,t+u  ■  ■  ■ ,Vk,t+h  is 

{{iJk,  1  j  •  •  ■  1  Vk,h)  \Uk,t  (i)  —  z(a/2h)^k  (^1  —  Vk,i  ^  2/fc,i  (*)  "F  z(a/2h)^k(f)  ; 

i  =  1, . . .  ,h}. 

(2.2.21) 

Thus,  for  the  example,  a  joint  forecast  region  for  2/2, t+i,  2/2, t+2  with  probability 
content  of  at  least  90%  is  given  by 

{(2/2,i,2/2,2)|1-24  <  2/2,1  <  5.16,  .87  <  2/2,2  <  5.03}. 

Under  our  assumption  of  a  Gaussian  process,  the  distribution  of  the  fore¬ 
casts  and  forecast  errors  is  known  and,  consequently,  forecast  intervals  are 
easy  to  set  up.  If  the  underlying  process  has  a  different  and  potentially  un¬ 
known  distribution,  considering  the  forecast  distribution  becomes  more  dif¬ 
ficult.  Even  then  methods  are  available  to  determine  more  than  just  point 
forecasts.  A  survey  of  density  forecasting  is  given  by  Tay  &  Wallis  (2002). 


2.3  Structural  Analysis  with  VAR  Models 

Because  VAR  models  represent  the  correlations  among  a  set  of  variables,  they 
are  often  used  to  analyze  certain  aspects  of  the  relationships  between  the 
variables  of  interest.  In  the  following,  three  ways  to  interpret  a  VAR  model 
will  be  discussed.  They  are  all  closely  related  and  they  are  all  beset  with 
problems  that  will  be  pointed  out  subsequently. 


2.3.1  Granger-Causality,  Instantaneous  Causality,  and  Multi-Step 
Causality 

Definitions  of  Causality 

Granger  (1969a)  has  defined  a  concept  of  causality  which,  under  suitable  con¬ 
ditions,  is  fairly  easy  to  deal  with  in  the  context  of  VAR  models.  Therefore 
it  has  become  quite  popular  in  recent  years.  The  idea  is  that  a  cause  cannot 
come  after  the  effect.  Thus,  if  a  variable  x  affects  a  variable  z,  the  former 
should  help  improving  the  predictions  of  the  latter  variable. 

To  formalize  this  idea,  suppose  that  is  the  information  set  containing  all 
the  relevant  information  in  the  universe  available  up  to  and  including  period 
t.  Let  zt(h\f2t)  be  the  optimal  (minimum  MSE)  h-step  predictor  of  the  process 
Zt  at  origin  t ,  based  on  the  information  in  f2t.  The  corresponding  forecast  MSE 
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will  be  denoted  by  Sz(h\Gt).  The  process  ay  is  said  to  cause  Zt  in  Granger’s 
sense  if 

Sz(h\nt)  <  U,(/i|l2t  \  {ay|s  <  t})  for  at  least  one  h  =  1,2, _  (2.3.1) 

Alternatively,  we  will  say  that  Xt  Granger-causes  (or  briefly  causes)  Zt  or  Xt  is 
Granger-causal  for  zt  if  (2.3.1)  holds.  In  (2.3.1)  12t  \  {oy|s  <  t}  is  the  set  con¬ 
taining  all  the  relevant  information  in  the  universe  except  for  the  information 
in  the  past  and  present  of  the  Xt  process.  In  other  words,  if  Zt  can  be  predicted 
more  efficiently  if  the  information  in  the  Xt  process  is  taken  into  account  in 
addition  to  all  other  information  in  the  universe,  then  Xt  is  Granger-causal 
for  zt- 

The  definition  extends  immediately  to  the  case  where  zt  and  xt  are  Al¬ 
and  TV-dimensional  processes,  respectively.  In  that  case,  Xt  is  said  to  Granger- 
cause  Zt  if 

Sz(h\Qt)  /  Ez(h\nt  \  {xa\a  <  t})  (2.3.2) 

for  some  t  and  h.  Alternatively,  this  could  be  expressed  by  requiring  the  two 
MSEs  to  be  different  and 

Sz(h\flt)  <  Sz[h\f2t  \  {ay|s  <  t }) 

(i.e.,  the  difference  between  the  right-hand  and  the  left-hand  matrix  is  posi¬ 
tive  semidefinite) .  Because  the  null  matrix  is  also  positive  semidefinite,  it  is 
necessary  to  require  in  addition  that  the  two  matrices  are  not  identical.  If  xt 
causes  zt  and  Zt  also  causes  Xt  the  process  (z't,x't)'  is  called  a  feedback  system. 

Sometimes  the  term  “instantaneous  causality”  is  used  in  economic  analy¬ 
ses.  We  say  that  there  is  instantaneous  causality  between  Zt  and  Xt.  if 

Ez(l\Gt  U  {ay+i})  ±  Sz(l\nt).  (2.3.3) 

In  other  words,  in  period  t,  adding  xt+\  to  the  information  set  helps  to  improve 
the  forecast  of  zt+ 1-  We  will  see  shortly  that  this  concept  of  causality  is  really 
symmetric,  that  is,  if  there  is  instantaneous  causality  between  zt  and  ay,  then 
there  is  also  instantaneous  causality  between  xt  and  zt  (see  Proposition  2.3). 
Therefore  we  do  not  use  the  notion  “instantaneous  causality  from  ay  to  zf  in 
the  foregoing  definition. 

A  possible  criticism  of  the  foregoing  definitions  could  relate  to  the  choice 
of  the  MSE  as  a  measure  of  the  forecast  precision.  Of  course,  the  choice  of 
another  measure  could  lead  to  a  different  definition  of  causality.  However, 
in  the  situations  of  interest  in  the  following,  equality  of  the  MSEs  will  im¬ 
ply  equality  of  the  corresponding  predictors.  In  that  case  a  process  Zt  is  not 
Granger-caused  by  ay  if  the  optimal  predictor  of  zt  does  not  use  information 
from  the  ay  process.  This  result  is  intuitively  appealing. 

A  more  serious  practical  problem  is  the  choice  of  the  information  set  fit. 
Usually  all  the  relevant  information  in  the  universe  is  not  available  to  a  fore¬ 
caster  and,  thus,  the  optimal  predictor  given  f2t  cannot  be  determined.  There¬ 
fore  a  less  demanding  definition  of  causality  is  often  used  in  practice.  Instead 
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of  all  the  information  in  the  universe,  only  the  information  in  the  past  and 
present  of  the  process  under  study  is  considered  relevant  and  (2t  is  replaced 
by  {2s,a:s|s  <  t}.  Furthermore,  instead  of  optimal  predictors,  optimal  linear 
predictors  are  compared.  In  other  words,  zt(h\f2t)  is  replaced  by  the  linear 
minimum  MSE  h- step  predictor  based  on  the  information  in  {zs,  a:s|s  <  t}  and 
zt[h\ftt  \  {ccs  |  s  <  t})  is  replaced  by  the  linear  minimum  MSE  h-  step  predictor 
based  on  {zs|s  <  t}.  In  the  following,  when  the  terms  “Granger-causality” 
and  “instantaneous  causality”  are  used,  these  restrictive  assumptions  are  im¬ 
plicitly  used  if  not  otherwise  noted. 


Characterization  of  Granger-Causality 

In  order  to  determine  the  Granger-causal  relationships  between  the  variables 
of  the  If -dimensional  VAR  process  yt,  suppose  it  has  the  canonical  MA  rep¬ 
resentation 


Vt  =  M  &jUt-i  =  ii  +  $(L)uu  ^0  =  -//C,  (2.3.4) 

i= 0 


where  ut  is  a  white  noise  process  with  nonsingular  covariance  matrix  Su. 
Suppose  that  yt  consists  of  the  M-dimensional  process  zt  and  the  ( K  —  M)- 
dimensional  process  Xt  and  the  MA  representation  is  partitioned  accordingly, 


Vt  = 


zt 

Xt 


1 

@ll{L)  <?!2  (L) 

Ult 

~r 

@21  (L)  @22  (A) 

.  U2t  . 

(2.3.5) 


Using  the  prediction  formula  (2.2.10),  the  optimal  1-step  forecast  of  zt  based 
on  yt  is 

zt0-\{ys\s  <  t})  =  [Im  ■  0]j/t(l)  (2.3.6) 

OO  OO 

=  A*1  +  53  ^ll,»ul,t+l-i  +  53  + 

i—i  i- 1 

Hence  the  forecast  error  is 


zt+i  -  zt(l\{ys\s  <  t})  =  ui,t+i.  (2.3.7) 

As  mentioned  in  Section  2.1.3,  a  subprocess  of  a  stationary  process  also 
has  a  prediction  error  MA  representation.  Thus, 

OO  OO 

Zt  =  Ml  +  53  l,iul,t-i  +  53  ^12,iw2,t-i 
2=0  i=  1 

oo 

=  M  i  +  y ^FjVt-u  (2.3.8) 

2=0 

where  Fq  =  Im  and  the  last  expression  is  a  prediction  error  MA  representa¬ 
tion.  Thus,  the  optimal  1-step  predictor  based  on  zt  only  is 
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zt(l|{zs|s  <  t})  =  Hi  +  y  FjVt+^j  (2.3.9) 

2=1 

and  the  corresponding  forecast  error  is 

Zt+ 1  -  zt{l\{za\s  <  t})  =  vt+1.  (2.3.10) 

Consequently,  the  predictors  (2.3.6)  and  (2.3.9)  are  identical  if  and  only  if 
Vt  =  ui,t  for  all  t.  In  other  words,  equality  of  the  predictors  is  equivalent  to 
zt  having  the  MA  representation 

OO  OO 

Zt  =  Mi  +  ^2  FiUij-i  =  Mi  +  : 

2=0  2  =  0 

OO 

—  Ml  +  ^^[^11,2  :  ^12,2]  —  2 
2=0 

OO  OO 

=  Ml  +  ^2  +  '^2  ^12,iu2,t-i- 

2=0  2=1 

Uniqueness  of  the  canonical  MA  representation  implies  that  t\  =  <Pnti  and 
<Pi2 ,i  =  0  for  *  =  1, 2, ... .  Hence,  we  get  the  following  proposition. 

Proposition  2.2  ( Characterization  of  Granger-Noncausality) 

Let  yt  be  a  VAR  process  as  in  (2.3.4)/(2.3.5)  with  canonical  MA  operator 
z ).  Then 

zt(l|{ys|s  <  t})  =  zt(l|{za|s  <  i})  4=>  #12,»=0  for  i  =  1,2, . . . . 

(2.3.11) 


Because  we  have  just  used  the  MA  representation  (2.3.4)  and  not  its  finite 
order  VAR  form,  the  proposition  is  not  only  valid  for  VAR  processes  but 
more  generally  for  processes  having  a  canonical  MA  representation  such  as 
(2.3.4).  From  (2.2.10)  it  is  obvious  that  equality  of  the  1-step  predictors  implies 

equality  of  the  h- step  predictors  for  h  =  2,  3, -  Hence,  the  proposition 

provides  a  necessary  and  sufficient  condition  for  xt  being  not  Granger-causal 
for  zt,  that  is,  zt  is  not  Granger-caused  by  xt  if  and  only  if  ^12,*  =  0  for 

i  =  1,2, _  Thus,  Granger-noncausality  can  be  checked  easily  by  looking 

at  the  MA  representation  of  yt .  Because  we  are  mostly  concerned  with  VAR 
processes,  it  is  worth  noting  that  for  a  stationary,  stable  VAR(p)  process 


Zt 

Vi 

Anp 

Ai2,l 

Zt- 1 

xt  _ 

Vi . 

+ 

41-21,1 

^22,1 

.  Xt~  1  . 

An  ,p 

A 12, P 

Zt—p 

_|_ 

Wit 

A21,p 

An,p 

%t~P 

1 

.  U2t  . 

(2.3.12) 
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the  condition  (2.3.11)  is  satisfied  if  and  only  if 
Ai2,i  =  0  for  *  =  1, . .  ,,p. 

This  result  follows  from  the  recursions  in  (2.1.22)  or,  alternatively,  because 
the  inverse  of 

'  @u  (L)  0 

^21  (L)  @22  (L) 


is 


@11  (L)-1  0 

—  @22{L)~1@2l{L)@ii{L)~1  @22  {L)~X 

Thus,  we  have  the  following  result. 

Corollary  2.2.1 

If  yt  is  a  stable  VAR(p)  process  as  in  (2.3.12)  with  nonsingular  white  noise 
covariance  matrix  Su,  then 


zt(h\{ys\s  <  t})  =  zt(h\{zs\s  <  t}),  h=  1, 2, . . . 
4=>  A12j1  =  0  for  i  =  1, . . .  ,p. 

Alternatively, 

xt{h\{ys\s  <  t})  =  xt(h\{xs\s  <  t}),  h=  1,2, .. . 
^21, <  =  0  for  i  =  1, . . .  ,p. 


(2.3.13) 


(2.3.14) 


This  corollary  implies  that  noncausalities  can  be  determined  by  just  look¬ 
ing  at  the  VAR  representation  of  the  system.  For  instance,  for  the  example 
process  (2.1.14), 


2/i,  t 

‘  .5  0  0  ' 

2/1,  t-i 

2/2,  t 

=  12  + 

.1  .1  .3 

2/2,  t-l 

.  2/3, t  _ 

0  .2  .3 

_  2/3, t-l 

Xt  '■=  (jj2t ,  VstY  cloes  not  Granger-cause  Zt  :=  yit  because  Ai2,i  =  0  if  the 
coefficient  matrix  is  partitioned  according  to  (2.3.12).  On  the  other  hand,  Zt 
Granger-causes  xt.  To  give  this  discussion  economic  content  let  us  assume 
that  the  variables  in  the  system  are  rates  of  change  of  investment  (j/i),  in¬ 
come  ( y2 ),  and  consumption  (2/3).  With  these  specifications,  the  previous 
discussion  shows  that  investment  Granger-causes  the  consumption/income 
system  whereas  the  converse  is  not  true.  It  is  also  easy  to  check  that  con¬ 
sumption  causes  the  income/investment  system  and  vice  versa.  Note  that  so 
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far  we  have  defined  Granger-causality  only  in  terms  of  two  groups  of  vari¬ 
ables.  Therefore,  at  this  stage,  we  cannot  talk  about  the  Granger-causal  re¬ 
lationship  between  consumption  and  income  in  the  three-dimensional  invest¬ 
ment  /income/consumption  system. 

Let  us  assume  that  the  variables  in  the  example  VAR(2)  process  (2.1.15), 


2/1,  t 

2/2,  t 


'  .5 

.1  " 

2/1, 4-1 

+ 

0 

0  ' 

2/1, 4-2 

.4 

.5 

2/2, 4-1 

.25 

0 

2/2, 4-2 

represent  the  inflation  rate  (yi),  and  some  interest  rate  (2/2)-  Using  Corollary 
2.2.1,  it  is  immediately  obvious  that  inflation  causes  the  interest  rate  and  vice 
versa.  Hence  the  system  is  a  feedback  system.  In  the  following  we  will  refer 
to  (2.1.15)  as  the  inflation/interest  rate  system. 


Characterization  of  Instantaneous  Causality 


In  order  to  study  the  concept  of  instantaneous  causality  in  the  framework 
of  the  MA  process  (2.3.5),  it  is  useful  to  rewrite  that  representation.  Note 
that  the  positive  definite  symmetric  matrix  Su  can  be  written  as  the  product 
Uu  =  PP',  where  P  is  a  lower  triangular  nonsingular  matrix  with  positive 
diagonal  elements  (see  Appendix  A. 9. 3).  Thus,  (2.3.5)  can  be  represented  as 

DO  OO 

2 It  =  M  +  y '$iPP-lUt-i  =  +  (2.3.15) 

i— 0  •£— 0 

where  (9,  :=  <PiP  and  wt  '■=  P~xut  is  white  noise  with  covariance  matrix 
SW  =  P-1EU(P-1)'  =  IK.  (2.3.16) 


Because  the  white  noise  errors  wt  have  uncorrelatecl  components,  they  are 
often  called  orthogonal  residuals  or  innovations. 

Partitioning  the  representation  (2.3.15)  according  to  the  partitioning  of 
2 It  =  (z't,x'ty  gives 


Zt 

Mi 

.  Xt  . 

.  ^2  . 

6*11,0 

©21,0 

0u,i 

©21,1 


0 

©22,0 

©12,1 

©22,1 


wi  ,4 

w2,t  _ 

W2,t-1 


Hence, 


zt+i  =  Mi  +  ©n,o«/i,t+i  +  ©ll.lMfl, t  +  ©12,1^2,4  -I - 


and 


xt+l  =  M 2  +  ©21,0^1,4+1  +  ©22,0^2,4+1  +  ©21,lM>l,t  +  ©22,1^2,4  H - • 
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The  optimal  1-step  predictor  of  xt  based  on  {ys | s  <  t}  and,  in  addition,  on 
Zt+ 1,  is  equal  to  the  1-step  predictor  of  xt  based  on  {ws|s  <  t}  U  {wi,t+i}, 
that  is, 

a;*(l|{yA|s  <  t}  U  {zt+ 1})  =  xt(l|{ws  =  {w'hs,w'2sy\s  <  t}  U  {wM+i}) 

=  6*21,0^1, t+i  +  xt(l|{ys|s  <  t}).  (2.3.17) 

Consequently, 

a?t(l|{2A|s  <  t}  U  {2t+i})  =  a;t(l|{ys|s  <  t}) 

if  and  only  if  6*21.0  =  0-  This  condition,  in  turn,  is  easily  seen  to  hold  if  and 
only  if  the  covariance  matrix  Uu  is  block  diagonal  with  a  (( K  —  M)  x  M) 
block  of  zeros  in  the  lower  left-hand  corner  and  an  ( M  x  (K  —  M ))  block  of 
zeros  in  the  upper  right-hand  corner.  Of  course,  this  means  that  u\t  and  U2t 
in  (2.3.5)  have  to  be  uncorrelated,  i.e.,  E(uitu'2t)  =  0.  Thereby  the  following 
proposition  is  proven. 

Proposition  2.3  ( Characterization  of  Instantaneous  Causality) 

Let  yt  be  as  in  (2.3.5)/(2.3.15)  with  nonsingular  innovation  covariance  matrix 
Eu.  Then  there  is  no  instantaneous  causality  between  and  Xt  if  and  only  if 

E(ultu'2t)  =  0.  (2.3.18) 


This  proposition  provides  a  condition  for  instantaneous  causality  which  is 
easy  to  check  if  the  process  is  given  in  MA  or  VAR  form.  For  instance,  for  the 
investment/income/consumption  system  with  white  noise  covariance  matrix 
(2.1.33), 


2u 


2.25  0  0 

0  1.0  .5 

0  .5  .74 


there  is  no  instantaneous  causality  between  (income,  consumption)  and  in¬ 
vestment. 

From  Propositions  2.2  and  2.3  it  follows  that  yt  =  ( z't,x't)'  has  a  represen¬ 
tation  with  orthogonal  innovations  as  in  (2.3.15)  of  the  form 


zt 

xt 


Ml 
M2  _ 

+ 

011,o 

0 

'  011,1 

0 

021,1 

022,1 

Mi 
M2  . 

+ 

'  0ii(L) 
021  (L 

0 

<922,0 


W\,t 

.  W2>t  . 

Wl,t- 1 
W2,t- 1 
0 

@22  (L) 


H - 

Wl,t 

_  W2,t 


(2.3.19) 


if  Xt  does  not  Granger-cause  zt  and,  furthermore,  there  is  no  instantaneous 
causation  between  Xt  and  zt ■  In  the  absence  of  instantaneous  causality,  a 
similar  representation  with  02\{L)  =  0  is  obtained  if  Zt  is  not  Granger-causal 
for  Xt- 
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Discussion  of  Instantaneous  and  Granger- Causality 

At  this  point,  some  words  of  caution  seem  appropriate.  The  term  “causality” 
suggests  a  cause  and  effect  relationship  between  two  sets  of  variables.  Propo¬ 
sition  2.3  shows  that  such  an  interpretation  is  problematic  with  respect  to 
instantaneous  causality  because  this  term  only  describes  a  nonzero  correla¬ 
tion  between  two  sets  of  variables.  It  does  not  say  anything  about  the  cause 
and  effect  relation.  The  direction  of  instantaneous  causation  cannot  be  derived 
from  the  MA  or  VAR  representation  of  the  process  but  must  be  obtained  from 
further  knowledge  on  the  relationship  between  the  variables.  Such  knowledge 
may  exist  in  the  form  of  an  economic  theory. 

Although  a  direction  of  causation  has  been  defined  in  relation  with 
Granger-causality  it  is  problematic  to  interpret  the  absence  of  causality  from 
Xt  to  Zt  in  the  sense  that  variations  in  xt  will  have  no  effect  on  zt .  To  see  this 
consider,  for  instance,  the  stable  bivariate  VAR(l)  system 

Zt  =  ai 1  °  Zt~1  +  Ult  .  (2.3.20) 

_  xt  \  L  a21  a22  J  L  xt—l  J  L  U2t  J 

In  this  system,  xt  does  not  Granger-cause  zt  by  Corollary  2.2.1.  However,  the 
system  may  be  multiplied  by  some  nonsingular  matrix 

r 1  p] 

1 . 


so  that 

zt  =  0-/9  zt  7n  712  zt~  i  vu 

xt  \  [  0  0  J  [  xt  \  [  721  722  J  [  27-1  J  [  v2t  \  ’ 


(2.3.21) 


where  7n  :=  an  +  a2i/3,  712  :=  a22/?,  721  :=  a2i,  722  :=  a22  and 
(vit,V2t)'  ■=  B(uit,  U2t)' ■  Note  that  this  is  just  another  representation  of 
the  process  ( zt,xt )'  and  not  another  process.  (The  reader  may  check  that 
the  process  (2.3.21)  has  the  same  means  and  autocovariances  as  the  one  in 
(2.3.20).) 

In  other  words,  the  stochastic  interrelationships  between  the  random  vari¬ 
ables  of  the  system  can  either  be  characterized  by  (2.3.20)  or  by  (2.3.21) 
although  the  two  representations  have  quite  different  physical  interpretations. 
If  (2.3.21)  happens  to  represent  the  actual  ongoings  in  the  system,  changes  in 
Xt  may  affect  Zt  through  the  term  with  the  coefficient  —/3  in  the  first  equation. 
Thus,  the  lack  of  a  Granger-causal  relationship  from  one  group  of  variables  to 
the  remaining  variables  cannot  necessarily  be  interpreted  as  lack  of  a  cause 
and  effect  relationship.  It  must  be  remembered  that  a  VAR  or  MA  represen¬ 
tation  characterizes  the  joint  distribution  of  sets  of  random  variables.  In  order 
to  derive  cause  and  effect  relationships  from  it,  usually  requires  further  as¬ 
sumptions  regarding  the  relationship  between  the  variables  involved.  We  will 
return  to  this  problem  in  the  following  subsections. 
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Further  problems  related  to  the  interpretation  of  Granger-causality  result 
from  restricting  the  information  set  to  contain  only  past  and  present  variables 
of  the  system  rather  than  all  information  in  the  universe.  Only  if  all  other 
information  in  the  universe  is  irrelevant  for  the  problem  at  hand,  the  reduction 
of  the  information  set  is  of  no  consequence.  Some  related  problems  will  be 
discussed  in  the  following. 

Changing  the  Information  Set 

So  far  it  has  been  assumed  that  the  information  set  contains  the  variables 
or  groups  of  variables  only  for  which  we  want  to  analyze  the  causal  links. 
Often  we  are  interested  in  the  causal  links  between  two  variables  in  a  higher 
dimensional  system.  In  other  words,  we  are  interested  in  analyzing  Granger- 
causality  in  a  framework  where  the  information  set  contains  more  than  just  the 
variables  of  direct  interest.  In  the  bivariate  framework  when  the  information 
set  is  limited  to  the  two  variables  of  interest,  it  was  seen  that  if  the  1-step 
ahead  forecasts  of  one  variable  cannot  be  improved  by  using  the  information 
in  the  other  variable,  the  same  holds  for  all  h- step  forecasts,  h  =  1,2,.... 
This  result  does  not  hold  anymore  if  the  information  set  contains  additional 
variables,  as  pointed  out  by  Liitkepohl  (1993). 

To  be  more  explicit,  suppose  the  vector  time  series  Zt,  yt,  xt  with  dimen¬ 
sions  Kz,  Ky,  Kx ,  respectively,  are  jointly  generated  by  a  VAR(p)  process 


Zt 

P 

r  zt~i  i 

yt 

—  /  J  A-i  Vt—i  ^ 1 1 

(2.3.22) 

. Xt . 

1=1 

L  xt-i  j 

where 

A 

-fi-ZZ,! 

A  A 

■^zy^i  -n-zx,! 

A  = 

A 

■fi-yz,! 

■Ayy^i  Ayx,i  ■)  i  1?  • 

■  ■  ,p, 

A 

-ti-XZ,! 

A  A 

■n-xy,!  -^XX,! 

with  Akiti  having  dimension  ( x  K{)  and  Ut  is  zero  mean  white  noise, 
as  usual.  In  this  process,  if  Azy>i  =  0,  i  =  1,2 it  is  not  difficult  to 
see  that  the  information  in  yt  cannot  be  used  to  improve  the  1-step  ahead 
forecasts  of  zt  but  it  is  still  possible  that  it  can  be  used  to  improve  the  h-step 

forecasts  for  h  =  2, 3, - In  other  words,  if  yt  is  1-step  noncausal  for  zt,  it  may 

still  be  h- step  causal  for  h  >  1.  Consequently,  it  makes  sense  to  define  more 
refined  concepts  of  causality  which  refer  explicitly  to  the  forecast  horizon.  For 
instance,  yt  may  be  called  h-step  noncausal  for  Zt  (yt-/* (h)zt)  for  h  =  1,2, ..., 
if  the  j-step  ahead  forecasts  of  zt  cannot  be  improved  for  j  <  h  by  taking  into 
account  the  information  in  past  and  present  yt .  Now  the  original  concept  of 
Granger-causality  corresponds  to  infinite-step  causality. 

The  corresponding  restrictions  of  multi-step  causality  on  the  VAR  coeffi¬ 
cients  have  been  considered  by  Dufour  &  Renault  (1998).  Unlike  in  the  bivari¬ 
ate  setting  explored  earlier,  now  nonlinear  restrictions  on  the  VAR  coefficients 
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are  obtained  which  make  it  more  difficult  to  check  for  h- step  causality  if  the 
information  set  is  expanded  by  additional  variables. 

To  state  the  restrictions  formally,  let  A  be  defined  as  in  the  VAR(l)  rep¬ 
resentation  (2.1.8),  let  J  =  [Ik  '■  0  :  •  •  •  :  0]  be  a  (K  x  Kp)  matrix  as  before 

and  define  =  JA3  and  aS^  =  vec(A^).  Dufour  &  Renault  (1998)  show 
that  in  the  process  (2.3.22),  yt~/^(h)zt  if  and  only  if 

RaS3)  =  0  for  j  =  l,...,h,  (2.3.23) 

and  ytl^{oo)zt  if  and  only  if 

RaS3)  =  0  for  j  =  1, . . .  ,pKx  +  1.  (2.3.24) 

Here  the  restriction  matrix  R  is  such  that  f?vec[Ai, . . . ,  Ap]  =  vec [Azy^, . . . , 
AZy^p],  that  is,  it  collects  the  elements  of  the  second  block  in  the  first  row  of 
each  of  the  coefficient  matrices. 

As  an  example  consider  again  the  3-dimensional  VAR(l)  process  (2.1.14). 
For  infinite-step  causality  or  noncausality  from  y2t  to  yu  we  need  to  check 
the  relevant  elements  of  the  coefficient  matrix  and  its  second  power: 


'  .5 

0 

0  ' 

'  .25 

0 

0 

.1 

.1 

.3 

A2  - 

,  /±l  — 

.06 

.07 

.12 

0 

.2 

.3 

.02 

.08 

.15 

Clearly,  ifet/Rpyit  holds  because  a12p  =  0  and  also  the  restrictions  for 
2/2t7^(oo)2/ii  are  satisfied  in  this  case  because  the  (l,2)-th  element  of  A\  is 
also  zero.  In  contrast,  yit-/*my3t  holds,  while  2/it/*(oo)2/3*  does  not,  because 
the  lower  left-hand  element  of  Af  is  nonzero.  Notice  that  the  definition  and 
characterizations  of  multi-step  causality  are  given  for  the  first  two  sets  of 
subvectors  with  the  third  one  containing  the  extra  variables.  For  applying 
the  definition  and  results  in  the  present  example,  the  variables  may  just  be 
rearranged  accordingly. 

In  addition  to  these  extensions  related  to  increasing  the  information  set, 
there  are  also  other  problems  which  may  make  it  difficult  to  interpret  Granger- 
causal  relations  even  in  a  bivariate  setting.  Let  us  discuss  some  of  them  in 
terms  of  an  inflation/interest  rate  system.  For  example,  it  may  make  a  differ¬ 
ence  whether  the  information  set  contains  monthly,  quarterly  or  annual  data. 
If  a  quarterly  system  is  considered  and  no  causality  is  found  from  the  inter¬ 
est  rate  to  inflation  it  does  not  follow  that  a  corresponding  monthly  interest 
rate  has  no  impact  on  the  monthly  inflation  rate.  In  other  words,  the  interest 
rate  may  Granger-cause  inflation  in  a  monthly  system  even  if  it  does  not  in  a 
quarterly  system. 

Furthermore,  putting  seasonally  adjusted  variables  in  the  information  set  is 
not  the  same  as  using  unadjusted  variables.  Consequently,  if  Granger-causality 
is  found  for  the  seasonally  adjusted  variables,  it  is  still  possible  that  in  the  ac¬ 
tual  seasonal  system  the  interest  rate  is  not  Granger-causal  for  inflation.  Sim¬ 
ilar  comments  apply  in  the  presence  of  measurement  errors.  Finally,  causality 
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analyses  are  usually  based  on  estimated  rather  than  known  systems.  Addi¬ 
tional  problems  result  in  that  case.  We  will  return  to  them  in  the  next  chapter. 

The  previous  critical  remarks  are  meant  to  caution  the  reader  and  multiple 
time  series  analyst  against  overinterpreting  the  evidence  from  a  VAR  model. 
Still,  causality  analyses  are  useful  tools  in  practice  if  these  critical  points  are 
kept  in  mind.  At  the  very  least,  a  Granger-causality  analysis  tells  the  analyst 
whether  a  set  of  variables  contains  useful  information  for  improving  the  pre¬ 
dictions  of  another  set  of  variables.  Further  discussions  of  causality  issues  and 
many  further  references  may  be  found  in  Geweke  (1984)  and  Granger  (1982). 

2.3.2  Impulse  Response  Analysis 

In  the  previous  subsection,  we  have  seen  that  Granger-causality  may  not  tell 
us  the  complete  story  about  the  interactions  between  the  variables  of  a  system. 
In  applied  work,  it  is  often  of  interest  to  know  the  response  of  one  variable  to 
an  impulse  in  another  variable  in  a  system  that  involves  a  number  of  further 
variables  as  well.  Thus,  one  would  like  to  investigate  the  impulse  response 
relationship  between  two  variables  in  a  higher  dimensional  system.  Of  course, 
if  there  is  a  reaction  of  one  variable  to  an  impulse  in  another  variable  we  may 
call  the  latter  causal  for  the  former.  In  this  subsection,  we  will  study  this  type 
of  causality  by  tracing  out  the  effect  of  an  exogenous  shock  or  innovation  in 
one  of  the  variables  on  some  or  all  of  the  other  variables.  This  kind  of  impulse 
response  analysis  is  sometimes  called  multiplier  analysis.  For  instance,  in  a 
system  consisting  of  an  inflation  rate  and  an  interest  rate,  the  effect  of  an 
increase  in  the  inflation  rate  may  be  of  interest.  In  the  real  world,  such  an 
increase  may  be  induced  exogenously  from  outside  the  system  by  events  like 
the  increase  of  the  oil  price  in  1973/74  when  the  OPEC  agreed  on  a  joint 
action  to  raise  prices.  Alternatively,  an  increase  or  reduction  in  the  interest 
rate  may  be  administered  by  the  central  bank  for  reasons  outside  the  simple 
two  variable  system  under  study. 

Responses  to  Forecast  Errors 

Suppose  the  effect  of  an  innovation  in  investment  in  a  system  containing 
investment  (yi),  income  (2/2),  and  consumption  (2/3)  is  of  interest.  To  isolate 
such  an  effect,  suppose  that  all  three  variables  assume  their  mean  value  prior 
to  time  t  =  0,  yt  =  y  for  t  <  0,  and  investment  increases  by  one  unit  in  period 
t  =  0,  that  is,  Ui  q  =  1.  Now  we  can  trace  out  what  happens  to  the  system 
during  periods  t  =  1,2,...  if  no  further  shocks  occur,  that  is,  112,0  =  ^3,0  =  0, 
Mi  =  0,  «2  =  0, . . . .  Because  we  are  not  interested  in  the  mean  of  the  system 
in  such  an  exercise  but  just  in  the  variations  of  the  variables  around  their 
means,  we  assume  that  all  three  variables  have  mean  zero  and  set  v  =  0  in 
(2.1.14).  Hence,  yt  =  Aiyt_i  +  ut  or,  more  precisely, 
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2/1, *  i  r  .5  o  o  i  r  2/i,t-i  r  mi,* 

2/2,*  =  -1  -1  -3  2/2,  t-i  +  M2,*  .  (2.3.25) 

_  2/3, t  J  [  0  -2  -3  J  [  2/3,*  l  J  L  u3 ,*  _ 

Tracing  a  unit  shock  in  the  first  variable  in  period  t  =  0  in  this  system  we  get 


'  2/1,2  1  I"  -25  ' 

2/2  =  2/2,2  =  ^12/1  =  ^4?  2/o  =  -06  . 

.  2/3.2  .  02  _ 

Continuing  the  procedure,  it  turns  out  that  y*  =  (2/1,*,  2/2,0  2/3, is  just  the 
first  column  of  A|.  An  analogous  line  of  arguments  shows  that  a  unit  shock 
in  2/2*  (2/3*)  at  f  =  0,  after  z  periods,  results  in  a  vector  yl  which  is  just  the 
second  (third)  column  of  A\.  Thus,  the  elements  of  A\  represent  the  effects  of 
unit  shocks  in  the  variables  of  the  system  after  i  periods.  Therefore  they  are 
called  impulse  responses  or  dynamic  multipliers. 

Recall  that  A\  =  <£,;  is  just  the  z-th  coefficient  matrix  of  the  MA  rep¬ 
resentation  of  a  VAR(l)  process.  Consequently,  the  MA  coefficient  matrices 
contain  the  impulse  responses  of  the  system.  This  result  holds  more  gener¬ 
ally  for  higher  order  VAR(p)  processes  as  well.  To  see  this,  suppose  that  yt 
is  a  stationary  VAR(p)  process  as  in  (2.1.1)  with  v  =  0.  This  process  has  a 
corresponding  VAR(l)  process  Yt  =  AYt_i  +  Ut  as  in  (2.1.8)  with  v  =  0. 
Under  the  assumptions  of  the  previous  example,  yt  =  0  for  t  <  0,  ut  =  0 
for  t  >  0  and  2/0  =  uo  is  a  A'-dimensional  unit  vector  e/j,  say,  with  a  one  as 
the  k- th  coordinate  and  zeros  elsewhere.  It  follows  that  Y(i  =  (e'k,  0, . . . ,  0)' 
and  Yi  =  A1Yq.  Hence,  the  impulse  responses  are  the  elements  of  the  upper 
left-hand  (K  x  K )  block  of  A*.  This  matrix,  however,  was  shown  to  be  the  *-th 
coefficient  matrix  <Pi  of  the  MA  representation  (2.1.17)  of  yt,  i.e.,  (Pj  =  ./A'./' 
with  J  :=  \1k  :  0  :  ■  •  •  :  0]  a  ( K  x  Kp)  matrix.  In  other  words,  4>jk,i ,  the 
jk- th  element  of  represents  the  reaction  of  the  j-th  variable  of  the  system 
to  a  unit  shock  in  variable  k,  i  periods  ago,  provided,  of  course,  the  effect  is 
not  contaminated  by  other  shocks  to  the  system.  Because  the  ut  are  just  the 
1-step  ahead  forecast  errors  of  the  VAR  process,  the  shocks  considered  here 
may  be  regarded  as  forecast  errors  and  the  impulse  responses  are  sometimes 
referred  to  as  forecast  error  impulse  responses. 

The  response  of  variable  j  to  a  unit  shock  (forecast  error)  in  variable 
k  is  sometimes  depicted  graphically  to  get  a  visual  impression  of  the  dy¬ 
namic  interrelationships  within  the  system.  Impulse  responses  of  the  invest¬ 
ment/income/consumption  system  are  plotted  in  Figure  2.5  and  the  dynamic 
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Fig.  2.5.  Impulse  responses  of  the  investment/income/consumption  system  (im¬ 
pulse  — >  response). 


responses  of  the  inflation/interest  rate  system  are  depicted  in  Figure  2.6.  For 
instance,  in  the  latter  figure  an  inflation  innovation  is  seen  to  induce  the  in¬ 
terest  rate  to  increase  for  two  periods  and  then  it  tapers  off  to  zero.  In  both 
systems  the  effect  of  a  unit  shock  in  any  of  the  variables  dies  away  quite 
rapidly  due  to  the  stability  of  the  systems. 

If  the  variables  have  different  scales,  it  is  sometimes  useful  to  consider  in¬ 
novations  of  one  standard  deviation  rather  than  unit  shocks.  For  instance, 
instead  of  tracing  an  unexpected  unit  increase  in  investment  in  the  in¬ 
vestment/income/consumption  system  with  white  noise  covariance  matrix 
(2.1.33),  one  may  follow  up  on  a  shock  of  -\/2.25  =  1.5  units  because  the 
standard  deviation  of  U\t  is  1-5.  Of  course,  this  is  just  a  matter  of  rescaling 
the  impulse  responses.  In  Figures  2.5  and  2.6,  it  suffices  to  choose  the  units 
at  the  vertical  axes  equal  to  the  standard  deviations  of  the  residuals  corre¬ 
sponding  to  the  variables  whose  effects  are  considered.  Such  a  rescaling  may 
sometimes  give  a  better  picture  of  the  dynamic  relationships  because  the  av¬ 
erage  size  of  the  innovations  occurring  in  a  system  depends  on  their  standard 
deviation. 
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Fig.  2.6.  Impulse  responses  of  the  inflation/interest  rate  system  (impulse  — ►  re¬ 
sponse). 


It  follows  from  Proposition  2.2  that  the  impulse  responses  are  zero  if  one 
of  the  variables  does  not  Granger-cause  the  other  variables  taken  as  a  group. 
More  precisely,  an  innovation  in  variable  k  has  no  effect  on  the  other  variables 
if  the  former  variable  does  not  Granger-cause  the  set  of  the  remaining  vari¬ 
ables.  As  we  have  mentioned  previously,  in  applied  work  it  is  often  of  foremost 
interest  whether  one  variable  has  an  impact  on  a  specific  other  variable.  That 
is,  one  would  like  to  know  whether,  for  some  k  ^  j,  4>jk,i  =  0  for  *  =  1,2,.... 
If  the  4>jk,i  represent  the  actual  reactions  of  variable  j  to  a  unit  shock  in 
variable  k,  we  may  call  the  latter  noncausal  for  the  j-th  variable  if  4>jk,i  =  0 
for  *  =  1,2, ... .  In  order  to  check  the  latter  condition,  it  is  not  necessary  to 
compute  infinitely  many  A,;  matrices.  The  following  proposition  shows  that  it 
suffices  to  check  the  first  p(K  —  1)  <£,;  matrices. 

Proposition  2.4  ( Zero  Impulse  Responses) 

If  yt  is  a  A'-dimensional  stable  VAR(p)  process,  then,  for  j  ^  k, 

(t>jk,%  =  0  for  *  =  1,2,... 
is  equivalent  to 
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<f>jk,i=  0  for  i  =  1, . . .  ,p(K  -  1). 


In  other  words,  the  proposition  asserts  that  for  a  /t-dimensional,  station¬ 
ary,  stable  VAR(p),  if  the  first  pK  —  p  responses  of  variable  j  to  an  impulse 
in  variable  k  are  zero,  all  the  following  responses  must  also  be  zero.  For  in¬ 
stance,  in  the  investment /income /consumption  VAR(l)  system,  because  the 
responses  of  investment  for  the  next  two  periods  after  a  consumption  impulse 
are  zero,  we  know  that  investment  will  not  react  at  all  to  such  an  impulse. 
Note,  that  in  a  VAR(l)  system  of  dimension  greater  than  2,  it  does  not  suffice 
to  check,  say,  the  upper  right-hand  corner  element  of  the  coefficient  matrix 
in  order  to  determine  whether  the  last  variable  is  noncausal  for  the  first  vari¬ 
able.  Notice  that  Proposition  2.4  is  related  to  the  conditions  for  multi-step 
causality  in  (2.3.23)  and  (2.3.24).  In  general,  the  conditions  are  not  identical, 
however,  because  the  two  concepts  differ.  Proposition  2.4  will  be  helpful  when 
testing  of  impulse  response  relations  is  discussed  in  the  next  chapter.  We  will 
now  prove  the  proposition. 

Proof  of  Proposition  2.f: 

Returning  to  the  lag  operator  notation  of  Section  2.1.2,  we  have 

${L)  =  {4>jk{L))jik  =  A(L)-1  =  A(L)adi  /  det(A(L)), 

where  A(L)adi  =  (. Ajk(L))jj .  is  the  adjoint  of  A(L)  =  Ik  —  A\L  —  ■  ■  ■  —  ApLp 
(see  Appendix  A. 4.1).  Obviously,  <t>jk{L)  =  0  is  equivalent  to  Ajk{L)  =  0. 
From  the  definition  of  a  cofactor  of  a  matrix  in  Appendix  A. 3,  it  is  easy  to  see 
that  Ajk(L)  has  degree  not  greater  than  pK—p.  Defining  7 (L)  =  [det  A(A)]”1, 
we  get  for  k  ^  j, 

4>jk(L)  =  (fjkpL  +  4>jk,2L~  +  •  •  • 

=  Ajk{L)^f(L) 

=  {AjkpL  +  ■  ■  ■  +  Ajk,PK-pLpK  p)(l  +  71 A  +  •  •  • ). 

Hence, 

i—l 

4*jk,i  =  Ajk  \  and  f^jk.i  =  Ajk}i  -t-  ^  )  Ajk^n'Ii—n  for  iP  1, 

n—1 

with  Ajk,n  =  0  for  n  >  pK—p.  Consequently,  Ajk,i  =  0  for  i  =  1, . . .  ,pK—p ,  is 
equivalent  to  < =  0  for  i  =  1, 2, . . .  ,pK—p,  which  proves  the  proposition.  ■ 

Sometimes  interest  centers  on  the  accumulated  effect  over  several  or 
more  periods  of  a  shock  in  one  variable.  This  effect  may  be  determined 
by  summing  up  the  MA  coefficient  matrices.  For  instance,  the  fc-th  column 
of  :=  0  contains  the  accumulated  responses  over  n  periods  to  a 


56 


2  Stable  Vector  Autoregressive  Processes 


unit  shock  in  the  fc-th  variable  of  the  system.  These  quantities  are  some¬ 
times  called  ?r-th  interim  multipliers.  The  total  accumulated  effects  for  all 
future  periods  are  obtained  by  summing  up  all  the  MA  coefficient  matrices. 
!?«,  :=  Y,Zo  sometimes  called  the  matrix  of  long-run  effects  or  total 
multipliers.  Because  the  MA  operator  $(z)  is  the  inverse  of  the  VAR  operator 
A(z)  =  Ik  ~  A\z  —  ■  ■  ■  —  Apzp,  the  long-run  effects  are  easily  obtained  as 

Voo  =  *(1)  =  (Ik-A1 - Ap)-\  (2.3.26) 

As  an  example,  accumulated  responses  for  the  investment /income/consump¬ 
tion  system  are  depicted  in  Figure  2.7.  Similarly,  interim  and  total  multipliers 
of  the  inflation/interest  rate  system  are  shown  in  Figure  2.8. 
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Fig.  2.7.  Accumulated  and  long-run  responses  of  the  investment /income/con¬ 
sumption  system  (impulse  — >  response). 


Responses  to  Orthogonal  Impulses 

A  problematic  assumption  in  this  type  of  impulse  response  analysis  is  that 
a  shock  occurs  only  in  one  variable  at  a  time.  Such  an  assumption  may  be 
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Fig.  2.8.  Accumulated  and  total  responses  of  the  inflation/interest  rate  system 
(impulse  — >  response). 


reasonable  if  the  shocks  in  different  variables  are  independent.  If  they  are  not 
independent  one  may  argue  that  the  error  terms  consist  of  all  the  influences 
and  variables  that  are  not  directly  included  in  the  set  of  y  variables.  Thus,  in 
addition  to  forces  that  affect  all  the  variables,  there  may  be  forces  that  affect 
variable  1,  say,  only.  If  a  shock  in  the  first  variable  is  due  to  such  forces  it 
may  again  be  reasonable  to  interpret  the  coefficients  as  dynamic  responses. 
On  the  other  hand,  correlation  of  the  error  terms  may  indicate  that  a  shock 
in  one  variable  is  likely  to  be  accompanied  by  a  shock  in  another  variable.  In 
that  case,  setting  all  other  residuals  to  zero  may  provide  a  misleading  picture 
of  the  actual  dynamic  relationships  between  the  variables.  For  example,  in 
the  investment /income/consumption  system,  the  white  noise  or  innovation 
covariance  matrix  is  given  in  (2.1.33), 


■Si;, 


2.25  0  0 

0  1.0  .5 

0  .5  .74 


Obviously,  there  is  a  quite  strong  positive  correlation  between  U2,t.  and  1x3,4, 
the  residuals  of  the  income  and  consumption  equations,  respectively.  Conse- 
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quently,  a  shock  in  income  may  be  accompanied  by  a  shock  in  consumption  in 
the  same  period.  Therefore,  forcing  the  consumption  innovation  to  zero  when 
the  effect  of  an  income  shock  is  traced,  as  in  the  previous  analysis,  may  in 
fact  obscure  the  actual  relation  between  the  variables. 

This  is  the  reason  why  impulse  response  analysis  is  often  performed  in 
terms  of  the  MA  representation  (2.3.15), 

OO 

Vt  =  J2  (2.3.27) 

i= 0 

where  the  components  of  wt  =  (wit, . . . ,  wxt)'  are  uncorrelated  and  have  unit 
variance,  Sw  =  Ik-  The  mean  term  is  dropped  again  because  it  is  of  no  inter¬ 
est  in  the  present  analysis.  Recall  that  the  representation  (2.3.27)  is  obtained 
by  decomposing  Su  as  Uu  =  PP' ,  where  P  is  a  lower  triangular  matrix,  and 
defining  <9j  =  <&iP  and  Wt  =  -P-1U(.  In  (2.3.27)  it  is  reasonable  to  assume  that 
a  change  in  one  component  of  wt  has  no  effect  on  the  other  components  be¬ 
cause  the  components  are  orthogonal  (uncorrelated).  Moreover,  the  variances 
of  the  components  are  one.  Thus,  a  unit  innovation  is  just  an  innovation  of  size 
one  standard  deviation.  The  elements  of  the  (9,;  are  interpreted  as  responses 
of  the  system  to  such  innovations.  More  precisely,  the  jk-th  element  of  <9;  is 
assumed  to  represent  the  effect  on  variable  j  of  a  unit  innovation  in  the  fc-th 
variable  that  has  occurred  i  periods  ago. 

To  relate  these  impulse  responses  to  a  VAR  model,  we  consider  the  zero 
mean  VAR(p)  process 


yt  —  Aiyt-i  +  •  •  •  +  Apyt-p  +  ut-  (2.3.28) 

This  process  can  be  rewritten  in  such  a  way  that  the  residuals  of  different 
equations  are  uncorrelated.  For  this  purpose,  we  choose  a  decomposition  of  the 
white  noise  covariance  matrix  Su  =  WEeW' ,  where  Ee  is  a  diagonal  matrix 
with  positive  diagonal  elements  and  W  is  a  lower  triangular  matrix  with  unit 
diagonal.  This  decomposition  is  obtained  from  the  Choleski  decomposition 
Uu  =  PP1  by  defining  a  diagonal  matrix  D  which  has  the  same  main  diagonal 
as  P  and  by  specifying  W  =  PD -1  and  Se  =  DD' . 

Premultiplying  (2.3.28)  by  A  :=  W~x  gives 

A  yt  =  A\yt-i  +  •  •  •  +  A*yt-P  +  £t ,  (2.3.29) 

where  A*  :=  A  A,;,  i  =  1 ,p,  and  e*  =  (eu, . . . ,  ext)'  '■=  Aiq  has  diagonal 
covariance  matrix, 


r£  =  E{ete't)  =  AT(«t«'t)A'  =  AEUA'. 

Adding  ( IK  —  A )yt  to  both  sides  of  (2.3.29)  gives 
yt  =  A*0yt  +  A\yt_i  +  •  •  •  +  A*yt_p  +  et, 


(2.3.30) 
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where  Aq  :=  1  k  —  A.  Because  W  is  lower  triangular  with  unit  diagonal,  the 
same  is  true  for  A.  Hence, 


a*  =  ik-  A  = 


0  0 

021  0 


0  0 
0  0 


0K1  0K  2 


0K,K- 1  0 


is  a  lower  triangular  matrix  with  zero  diagonal  and,  thus,  in  the  representation 
(2.3.30)  of  our  VAR(p)  process,  the  first  equation  contains  no  instantaneous 
y' s  on  the  right-hand  side.  The  second  equation  may  contain  y\t  and  other¬ 
wise  lagged  y's  on  the  right-hand  side.  More  generally,  the  fc-th  equation  may 
contain  y\t, . . .  ,yk- i,t  and  not  ykt ,  •  •  •  ,yxt.  on  the  right-hand  side.  Thus,  if 
(2.3.30)  reflects  the  actual  ongoings  in  the  system,  yst  cannot  have  an  instan¬ 
taneous  impact  on  y^t.  for  k  <  s.  In  the  econometrics  literature  such  a  system 
is  called  a  recursive  model  (see  Theil  (1971,  Section  9.6)).  Herman  Wold  has 
advocated  these  models  where  the  researcher  has  to  specify  the  instantaneous 
“causal”  ordering  of  the  variables.  This  type  of  causality  is  therefore  sometimes 
referred  to  as  Wold- causality.  If  we  trace  £u  innovations  of  size  one  standard 
error  through  the  system  (2.3.30),  we  just  get  the  0  impulse  responses.  This 
can  be  seen  by  solving  the  system  (2.3.30)  for  yt , 

Vt  =  (Ik  -  A*)~1A*1yt^1  +  •••  +  (/**■  A*0)~l A*pyt_p  +  (IK  -  A*)"1^. 

Noting  that  (Ik  —  ^o)_1  =  W  =  PD~X  shows  that  the  instantaneous  effects 
of  one-standard  deviation  shocks  (e^s  of  size  one  standard  deviation)  to  the 
system  are  represented  by  the  elements  of  WD  =  P  =  0O  because  the  diagonal 
elements  of  D  are  just  standard  deviations  of  the  components  of  ey.  The  Ot 
may  then  be  obtained  by  tracing  these  effects  through  the  system. 

The  00s  may  provide  response  functions  that  are  quite  different  from  the 
<Pi  responses.  For  the  example  VAR(l)  system  (2.3.25)  with  Su  as  in  (2.1.33) 
we  get 


0o  =  P  = 


1.5  0  0 

0  10 
0  .5  .7 


0!  =$i  P  = 


.75  0  0 

.15  .25  .21 
0  .35  .21 


.375  0  0 

.090  .130  .084 
.030  .055  .105 


02  =  $2P  = 


(2.3.31) 
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and  so  on.  Some  more  innovation  responses  are  depicted  in  Figure  2.9.  Al¬ 
though  they  are  similar  to  those  given  in  Figure  2.5,  there  is  an  obvious 
difference  in  the  response  of  consumption  to  an  income  innovation.  While 
consumption  responds  with  a  time  lag  of  one  period  in  Figure  2.5,  there  is  an 
instantaneous  effect  in  Figure  2.9. 
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Fig.  2.9.  Orthogonalized  impulse  responses  of  the  investment /income/consumption 
system  (impulse  — +  response). 


Note  that  ©o  =  P  is  lower  triangular  and  some  elements  below  the  diagonal 
will  be  nonzero  if  Uu  has  nonzero  off-diagonal  elements.  For  instance,  for  the 
investment/income/consumption  example  ©o  indicates  that  an  income  (y^) 
innovation  has  an  immediate  impact  on  consumption  (t/3).  If  the  white  noise 
covariance  matrix  Eu  contains  zeros,  some  components  of  ut  =  ( uu , . . . ,  Uxt)' 
are  contemporaneously  uncorrelated.  Suppose,  for  instance,  that  uu  is  uncor¬ 
related  with  uu  for  i  =  2, ...  ,K.  In  this  case,  A  —  IF-1  and,  thus,  Aq  has  a 
block  of  zeros  so  that  y\  has  no  instantaneous  effect  on  yi,  i  =  2, . . . ,  K .  In  the 
example,  investment  has  no  instantaneous  impact  on  income  and  consumption 
because 
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2.25  0  0 

0  1.0  .5 

0  .5  .74 


and,  hence,  u\t  is  uncorrelated  with  u%t  and  u^t-  This,  of  course,  is  reflected  in 
the  matrix  of  instantaneous  effects  Oq  given  in  (2.3.31).  Because  the  elements 
of  P  =  0o  represent  the  immediate  responses  of  the  system  to  unit  innovations 
they  are  sometimes  called  impact  multipliers. 

In  order  to  determine  whether  there  is  no  response  at  all  of  one  variable  to 
an  impulse  in  one  of  the  other  variables,  it  suffices  to  consider  the  first  pK  —p 
response  coefficients  and  the  immediate  effect.  This  result  is  stated  formally 
in  the  next  proposition  where  0jk,i  denotes  the  jk- th  element  of  (9j. 

Proposition  2.5  ( Zero  Orthogonalized  Impidse  Responses) 

If  yt  is  a  A'-dimensional  stable  VAR(p)  process,  then,  for  j  ^  k , 


djk,i  =  0  for  i  =  0, 1,2, . . . 
is  equivalent  to 


0jk,i  =  0  for  i  =  0, 1, . . .  ,p(K  -  1). 


The  proof  of  this  result  is  analogous  to  that  of  Proposition  2.4  and  is  left 
as  an  exercise  (see  Problem  2.2). 

The  fact  that  Oq  is  lower  triangular  shows  that  the  ordering  of  the  vari¬ 
ables  is  of  importance,  that  is,  it  is  important  which  of  the  variables  is  called 
yi  and  which  one  is  called  yi  and  so  on.  One  problem  with  this  type  of  impulse 
response  analysis  is  that  the  ordering  of  the  variables  cannot  be  determined 
with  statistical  methods  but  has  to  be  specified  by  the  analyst.  The  order¬ 
ing  has  to  be  such  that  the  first  variable  is  the  only  one  with  a  potential 
immediate  impact  on  all  other  variables.  The  second  variable  may  have  an 
immediate  impact  on  the  last  K  —  2  components  of  yt  but  not  on  y\t  and  so 
on.  To  establish  such  an  ordering  may  be  a  quite  difficult  exercise  in  practice. 
The  choice  of  the  ordering,  the  Wold  causal  ordering,  may,  to  a  large  extent, 
determine  the  impulse  responses  and  is  therefore  critical  for  the  interpreta¬ 
tion  of  the  system.  Currently  we  are  dealing  with  known  systems  only.  In  this 
situation,  assuming  that  the  ordering  is  known  may  not  be  a  great  restric¬ 
tion.  For  the  investment/income/consumption  example  it  may  be  reasonable 
to  assume  that  an  increase  in  income  has  an  immediate  effect  on  consumption 
while  increased  consumption  stimulates  the  economy  and,  hence,  income  with 
some  time  lag. 

Our  interpretation  of  orthogonalized  impulse  responses  is  based  on  the  rep¬ 
resentation  (2.3.30)  and  the  impulses  are  viewed  as  changes  in  the  observed 
variables.  Sometimes  it  is  more  plausible  to  focus  on  impulses  which  cannot 
be  associated  easily  with  changes  in  a  specific  observed  variable  within  the 
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system.  In  that  case,  it  may  be  more  logical  to  base  the  interpretation  on  the 
MA  representation  (2.3.27)  which  decomposes  the  variables  in  contributions 
of  the  Wkt  innovations.  If  these  innovations  can  be  associated  with  a  specific 
impulse  to  the  system,  the  orthogonalized  impulse  responses  reflect  the  reac¬ 
tions  of  the  variables  to  such  possibly  unobserved  innovations.  In  that  case,  a 
specific  impulse  or  shock  to  the  system  can  have  an  instantaneous  impact  on 
several  variables  while  some  other  impulse  may  only  have  an  instantaneous 
effect  on  one  specific  variable  and  may  effect  the  other  variables  only  with 
some  delay.  By  decomposing  Su  =  PP'  with  some  non-triangular  P  matrix, 
it  is  also  possible  that  all  shocks  have  instantaneous  effects  on  all  observed 
variables  of  the  system.  In  this  kind  of  interpretation,  finding  the  decomposi¬ 
tion  matrix  P  and,  hence,  the  innovations  Wt  which  actually  can  be  associated 
with  shocks  of  interest,  is  often  a  difficult  part  of  the  analysis.  We  will  provide 
a  more  in-depth  discussion  of  the  related  problems  in  Chapter  9  which  deals 
with  structural  VAR  models. 

Critique  of  Impulse  Response  Analysis 

Besides  specifying  the  relevant  impulses  to  a  system,  there  are  a  number  of 
further  problems  that  render  the  interpretation  of  impulse  responses  difficult. 
We  have  mentioned  some  of  them  in  the  context  of  Granger-causality.  A  major 
limitation  of  our  systems  is  their  potential  incompleteness.  Although  in  real 
economic  systems  almost  everything  depends  on  everything  else,  we  will  usu¬ 
ally  work  with  low-dimensional  VAR  systems.  All  effects  of  omitted  variables 
are  assumed  to  be  in  the  innovations.  If  important  variables  are  omitted  from 
the  system,  this  may  lead  to  major  distortions  in  the  impulse  responses  and 
makes  them  worthless  for  structural  interpretations.  The  system  may  still  be 
useful  for  prediction,  though. 

To  see  the  related  problems  more  clearly,  consider  a  system  yt  which  is 
partitioned  in  vectors  zt  and  Xt  as  in  (2.3.5).  If  the  Zt  variables  are  considered 
only  and  the  Xt  variables  are  omitted  from  the  analysis,  we  get  a  system 

OO  OO 

Zt  =  M 1  +  ^2/  ^ll,»ul ,t-i  +  ^2 
2=0  2=1 

OO 

=  Mi  +  ^iVt-ii  (2.3.32) 

2=0 

as  in  (2.3.8).  The  actual  reactions  of  the  Zt  components  to  innovations  u\t  may 
be  given  by  the  matrices.  On  the  other  hand,  the  F)  or  corresponding 
orthogonalized  “impulse  responses’’  are  likely  to  be  interpreted  as  impulse 
responses  if  the  analyst  does  not  realize  that  important  variables  have  been 
omitted.  As  we  have  seen  in  Section  2.3.1,  the  Ft  will  be  equal  to  the  <Pnti  if 
and  only  if  xt  does  not  Granger-cause  zt. 

Further  problems  related  to  the  interpretation  of  the  MA  coefficients  as 
dynamic  multipliers  or  impulse  responses  result  from  measurement  errors  and 
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the  use  of  seasonally  adjusted  or  temporally  and/or  contemporaneously  ag¬ 
gregated  variables.  A  detailed  account  of  the  aggregation  problem  is  given  by 
Liitkepohl  (1987).  We  will  discuss  these  problems  in  more  detail  in  Chapter 
11  in  the  context  of  more  general  models.  These  problems  severely  limit  the 
interpretability  of  the  MA  coefficients  of  a  VAR  system  as  impulse  responses. 
In  the  next  subsection  a  further  possibility  to  interpret  VAR  models  will  be 
considered. 


2.3.3  Forecast  Error  Variance  Decomposition 

If  the  innovations  which  actually  drive  the  system  can  be  identified,  a  further 
tool  for  interpreting  VAR  models  is  available.  Suppose  a  recursive  identifi¬ 
cation  scheme  is  available  so  that  the  MA  representation  (2.3.15)  with  or¬ 
thogonal  white  noise  innovations  may  be  considered.  In  the  context  of  the 
representation 

OO 

yt  =  H  +  OiWt-i  (2.3.33) 

i= o 

with  Ew  =  Ik,  the  error  of  the  optimal  h- step  forecast  is 
h— 1  h— 1 

yt+h  -  yt(h)  =  ^2  <PiUt+h-i  =  ^  &ipp-1ut+h-i 

i— 0  i— 0 

h- 1 

=  (2.3.34) 

i= 0 

Denoting  the  ran-th  element  of  (9*  by  as  before,  the  h- step  forecast  error 
of  the  j-th  component  of  yt  is 
h- 1 

Vj,t+h  Vj,t{h)  =  ^  i  T  •  ’  ’  T  0 jK,i^K,t-\-h—i) 

i= 0 
K 

—  T  •  •  •  T  @jk,h-i'Wkjt-\-i)’  (2.3.35) 

k= 1 

Thus,  the  forecast  error  of  the  j-th.  component  potentially  consists  of  all  the 
innovations  wu, . . .  ,WKt-  Of  course,  some  of  the  9mn,i  may  be  zero  so  that 
some  components  may  not  appear  in  (2.3.35).  Because  the  Wk,t  s  are  uncor¬ 
related  and  have  unit  variances,  the  MSE  of  Vj,t(h)  is 

K 

E(yj,t+h  ~  Vj,t(h))2  =  y^M  + - b  92k  h_ i). 

k= 1 


Therefore, 
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h- 1 

9%, o  +  @jk, i  +  ' ' '  +  Ojk,h- i  =  ^(e' ©ie/c)2  (2.3.36) 

i= o 

is  sometimes  interpreted  as  the  contribution  of  innovations  in  variable  k  to 
the  forecast  error  variance  or  MSE  of  the  h- step  forecast  of  variable  j.  Here 
ek  is  the  fc-th  column  of  Ik-  Dividing  (2.3.36)  by 

h- 1  K 

MSE[m{h)]  =  EE4. 

i= 0  k= 1 


gives 


h- 1 

L0jk,h  =  ^(e'0iefc)2/MSE[2/i!t(/i)]  (2.3.37) 

i=0 

which  is  the  proportion  of  the  h- step  forecast  error  variance  of  variable  j,  ac¬ 
counted  for  by  Wkt  innovations.  If  wkt  can  be  associated  with  variable  k,  Wjk,h 
represents  the  proportion  of  the  ft.-step  forecast  error  variance  accounted  for 
by  innovations  in  variable  k.  Thereby,  the  forecast  error  variance  is  decom¬ 
posed  into  components  accounted  for  by  innovations  in  the  different  variables 
of  the  system.  From  (2.3.34),  the  h- step  forecast  MSE  matrix  is  seen  to  be 

h- 1  h- 1 

£y(h)  =  MSE[yt(/l)]  =  £  OS',  =  Y  ^u$\. 

2=0  «=0 

The  diagonal  elements  of  this  matrix  are  the  MSEs  of  the  yjt  variables  which 
may  be  used  in  (2.3.37). 

For  the  investment/income/consumption  example,  forecast  error  variance 
decompositions  of  all  three  variables  are  given  in  Table  2.1.  For  instance, 
about  66%  of  the  1-step  forecast  error  variance  of  consumption  is  accounted 
for  by  own  innovations  and  about  34%  is  accounted  for  by  income  innovations. 
For  long  term  forecasts,  57.5%  and  42.3%  of  the  error  variance  is  accounted 
for  by  consumption  and  income  innovations,  respectively.  For  any  forecast 
horizon,  investment  innovations  contribute  less  than  1%  to  the  forecast  error 
variance  of  consumption.  Moreover,  only  small  fractions  (less  than  10%)  of 
the  forecast  error  variances  of  income  are  accounted  for  by  innovations  in 
the  other  variables  of  the  system.  This  kind  of  analysis  is  sometimes  called 
innovation  accounting. 

From  Proposition  2.5,  it  is  obvious  that  for  a  stationary,  stable,  A'- dimen¬ 
sional  VAR.(p)  process  yt  all  forecast  error  variance  proportions  of  variable 
j,  accounted  for  by  innovations  in  variable  k,  will  be  zero  if  uijk.h  =  0  for 
h  =  pK—p+1.  In  this  context  it  is  perhaps  worth  pointing  out  the  relationship 
between  Granger-causality  and  forecast  error  variance  components.  For  that 
purpose  we  consider  a  bivariate  system  yt  =  ( Zt ,  Xt)'  first.  In  such  a  system,  if 
zt  does  not  Granger-cause  xt ,  the  proportions  of  forecast  error  variances  of  xt 
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Table  2.1.  Forecast  error  variance  decomposition  of  the  investment /income/con¬ 
sumption  system 


forecast 

forecast 

proportions  of  forecast  error 

variance  h  periods 

error 

horizon 

ahead  accounted  for  by  innovations  in 

in 

h 

investment 

income 

consumption 

investment 

1 

1 

0 

0 

2 

1 

0 

0 

3 

1 

0 

0 

4 

1 

0 

0 

5 

1 

0 

0 

10 

1 

0 

0 

oo 

1 

0 

0 

income 

1 

0 

1 

0 

2 

.020 

.941 

.039 

3 

.026 

.930 

.044 

4 

.029 

.926 

.045 

5 

.030 

.925 

.045 

10 

.030 

.925 

.045 

oo 

.030 

.925 

.045 

consumption 

1 

0 

.338 

.662 

2 

0 

.411 

.589 

3 

.001 

.421 

.578 

4 

.002 

.423 

.576 

5 

.002 

.423 

.575 

10 

.002 

.423 

.575 

00 

.002 

.423 

.575 

accounted  for  by  innovations  in  Zt  may  still  be  nonzero.  This  property  follows 
directly  from  the  definition  of  the  (9;  in  (2.3.15).  Granger-noncausality,  by 
Proposition  2.2,  implies  zero  constraints  on  the  which  may  disappear  in 
the  Ot  if  the  error  covariance  matrix  Su  is  not  diagonal.  On  the  other  hand,  if 
Su  is  diagonal,  so  that  there  is  no  instantaneous  causation  between  zt  and  Xt 
and  if,  in  addition,  zt  is  not  Granger-causal  for  xt  the  lower  left-hand  elements 
of  the  Oi  will  be  zero  (see  (2.3.19)).  Therefore,  the  proportion  of  forecast  error 
variance  of  Xt  accounted  for  by  zt  innovations  will  also  be  zero. 

In  a  higher  dimensional  system,  suppose  a  set  of  variables  zt  does  not 
Granger-cause  the  remaining  variables  Xt  and  there  is  also  no  instantaneous 
causality  between  the  two  sets  of  variables.  In  that  case,  the  forecast  MSE 
proportions  of  all  Xt  variables  accounted  for  by  Zt  variables  will  be  zero. 

It  is  important  to  understand,  however,  that  Granger-causality  and  fore¬ 
cast  error  variance  decompositions  are  quite  different  concepts  because  Gran¬ 
ger-causality  and  instantaneous  causality  are  different  concepts.  While  Gran¬ 
ger-causality  is  a  uniquely  defined  property  of  two  subsets  of  variables  of  a 
given  process,  the  forecast  error  variance  decomposition  is  not  unique  as  it  de¬ 
pends  on  the  Oi  matrices  and,  thus,  on  the  choice  of  the  transformation  matrix 
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P.  Therefore,  the  interpretation  of  a  forecast  error  variance  decomposition  is 
subject  to  similar  criticisms  as  the  interpretation  of  impulse  responses.  In  ad¬ 
dition,  all  the  critical  points  raised  in  the  context  of  Granger-causality  apply. 
That  is,  the  forecast  error  variance  components  are  conditional  on  the  system 
under  consideration.  They  may  change  if  the  system  is  expanded  by  adding 
further  variables  or  if  variables  are  deleted  from  the  system.  Also  measure¬ 
ment  errors,  seasonal  adjustment  and  the  use  of  aggregates  may  contaminate 
the  forecast  error  variance  decompositions. 

2.3.4  Remarks  on  the  Interpretation  of  VAR  Models 

Innovation  accounting  and  impulse  response  analysis  in  the  framework  of  VAR 
models  have  been  pioneered  by  Sims  (1980,  1981)  and  others  as  an  alternative 
to  classical  macroeconomic  analyses.  Sims’  main  criticism  of  the  latter  type 
of  analysis  is  that  macroeconometric  models  are  often  not  based  on  sound 
economic  theories  or  the  available  theories  are  not  capable  of  providing  a 
completely  specified  model.  If  economic  theories  are  not  available  to  specify 
the  model,  statistical  tools  must  be  applied.  In  this  approach,  a  fairly  loose 
model  is  set  up  which  does  not  impose  rigid  a  priori  restrictions  on  the  data 
generation  process.  Statistical  tools  are  then  used  to  determine  possible  con¬ 
straints.  VAR  models  represent  a  class  of  loose  models  that  may  be  used  in 
such  an  approach.  Of  course,  in  order  to  interpret  these  models,  some  restric¬ 
tive  assumptions  need  to  be  made.  In  particular,  the  ordering  of  the  variables 
may  be  essential  for  interpretations  of  the  types  discussed  in  the  previous 
subsections.  Sims  (1981)  suggests  to  try  different  orderings  and  investigate 
the  sensitivity  of  the  corresponding  orthogonalized  impulse  responses  and  the 
related  conclusions  to  the  ordering  of  the  variables. 

So  far  we  have  assumed  that  a  VAR  model  is  given  to  us.  Under  this 
assumption  we  have  discussed  forecasting  and  interpretation  of  the  system. 
In  this  situation  it  is  of  course  unnecessary  to  use  statistical  tools  in  order 
to  determine  constraints  for  the  system  because  all  constraints  are  known.  In 
practice,  we  will  virtually  never  be  in  such  a  fortunate  situation  but  we  have 
to  determine  the  model  from  a  given  set  of  time  series  data.  This  problem  will 
be  treated  in  subsequent  chapters.  The  purpose  of  this  chapter  is  to  identify 
some  problems  that  are  not  related  to  estimation  and  model  specification  but 
are  inherent  to  the  types  of  models  considered. 


2.4  Exercises 

Problem  2.1 
Show  that 

det (Ikp  ~  A z)  =  det (Ik  —  A\z  —  ■  ■  ■  —  Apzp ) 
where  A,,  i  =  1, . . .  ,p,  and  A  are  as  in  (2.1.1)  and  (2.1.8),  respectively. 
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Problem  2.2 

Prove  Proposition  2.5. 

(Hint:  G(L)  =  <P(L)P  =  A(L)adiP/  det  A{L)). 

Problem  2.3 

In  the  United  States  of  Wonderland  the  growth  rates  of  income  (GNP)  and 
the  money  stock  (M2)  as  well  as  an  interest  rate  (IR)  are  related  as  in  the 
following  VAR(2)  model: 


"  GNP*  ' 

"  2  ' 

r  -7 

.1 

0  ■ 

'  GNPt_!  ' 

M2t 

= 

1 

+ 

0  .4 

.1 

M2t_! 

IRt 

0 

.9 

0 

.8 

iRt-i 

"  -.2  0  0  ' 

"  GNPi_2  " 

Wit 

0  .1  .1 

M2t_2 

+ 

W2t 

000 

IRt-2 

.  U3t  _ 

'  .26 

.03 

0 

'  .5 

.1 

0  ' 

.03 

.09 

0 

=  PPr, 

P  = 

0 

.3 

0 

0 

0 

.81 

0 

0 

.9 

(a)  Show  that  the  process  yt  =  (GNPt,  M2t,  IRt)'  is  stable. 

(b)  Determine  the  mean  vector  of  yt- 

(c)  Write  the  process  yt  in  VAR(l)  form. 

(d)  Compute  the  coefficient  matrices  of  the  MA  representation 

(2.1.17)  of  yt. 

Problem  2-4 

Determine  the  autocovariances  Pv(0),  Py(  1),  Py{ 2),  Py( 3)  of  the  process  de¬ 
fined  in  (2.4.1).  Compute  and  plot  the  autocorrelations  Ry(0),  Ry(  1),  Ry( 2), 

Ry(  3)- 

Problem  2.5 

Consider  again  the  process  (2.4.1). 

(a)  Suppose  that 


2/2000  — 

.7  ' 
1.0 

and  2/1-999  = 

'  1.0  ' 
1.5 

1.5 

3.0 

and  forecast  1/2001,  2/2002,  and  1/2003  ■ 

(b)  Determine  the  MSE  matrices  for  forecast  horizons  h  =  1,  2,  3. 

(c)  Assume  that  yt  is  a  Gaussian  process  and  construct  90%  and  95%  forecast 
intervals  for  t  =  2001,  2002, 2003. 

(d)  Use  the  Bonferroni  method  to  determine  a  joint  forecast  region  for 
GNP200I)  GNP2002,  GNP2003  with  probability  content  at  least  97%. 
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Problem  2.6 

Answer  the  following  questions  for  the  process  (2.4.1). 

(a)  Is  M2  Granger-causal  for  (GNP,  IR)? 

(b)  Is  IR  Granger-causal  for  (GNP,  M2)? 

(c)  Is  there  instantaneous  causality  between  M2  and  (GNP,  IR)? 

(d)  Is  there  instantaneous  causality  between  IR  and  (GNP,  M2)? 

(e)  Is  IR  2-step  causal  for  GNP? 

Problem  2.1 

Plot  the  effect  of  a  unit  innovation  in  the  interest  rate  (IR)  on  the  three 
variables  of  the  system  (2.4.1)  in  terms  of  the  MA  representation  (2.1.17). 
Consider  only  5  periods  following  the  innovation.  Plot  also  the  accumulated 
responses  and  interpret  the  plots. 

Problem  2.8 

For  the  system  (2.4.1),  derive  the  coefficient  matrices  (9o, ...  ,6*5  of  the  MA 
representation  (2.3.15)  using  the  upper  triangular  P  matrix  given  in  (2.4.1). 
Plot  the  effects  of  a  unit  innovation  in  IR  in  terms  of  that  representation. 
Compare  to  the  plots  obtained  in  Problem  2.7  and  interpret.  Repeat  the 
analysis  with  a  lower  triangular  P  matrix  and  comment  on  the  results. 

Problem  2.9 

Decompose  the  MSE  of  the  forecast  GNPt  (5)  into  the  proportions  accounted 
for  by  its  own  innovations  and  innovations  in  M2  and  IR. 
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3.1  Introduction 

In  this  chapter,  it  is  assumed  that  a  A'-dimensional  multiple  time  series 
ut  with  ijt  =  (yit,  ■  ■  ■ ,  UKt)'  is  available  that  is  known  to  be  gener¬ 
ated  by  a  stationary,  stable  VAR(p)  process 

Ut  =  v  +  Aiyt-i  +  •  •  •  +  Apyt-p  +  ut ■  (3.1.1) 

All  symbols  have  their  usual  meanings,  that  is,  v  =  (z/ls . . . ,  vK)'  is  a  ( K  x 
1)  vector  of  intercept  terms,  the  A,;  are  ( K  x  K)  coefficient  matrices  and 
Ut  is  white  noise  with  nonsingular  covariance  matrix  Su.  In  contrast  to  the 
assumptions  of  the  previous  chapter,  the  coefficients  v,  Ai, . . . ,  Api  and  Su  are 
assumed  to  be  unknown  in  the  following.  The  time  series  data  will  be  used  to 
estimate  the  coefficients.  Note  that  notationwise  we  do  not  distinguish  between 
the  stochastic  process  and  a  time  series  as  a  realization  of  a  stochastic  process. 
The  particular  meaning  of  a  symbol  should  be  obvious  from  the  context. 

In  the  next  three  sections,  different  possibilities  for  estimating  a  VAR(p) 
process  are  discussed.  In  Section  3.5,  the  consequences  of  forecasting  with 
estimated  processes  will  be  considered  and,  in  Section  3.6,  tests  for  causality 
are  described.  The  distribution  of  impulse  responses  obtained  from  estimated 
processes  is  considered  in  Section  3.7. 


3.2  Multivariate  Least  Squares  Estimation 

In  this  section,  multivariate  least  squares  (LS)  estimation  is  discussed.  The 
estimator  obtained  for  the  standard  form  (3.1.1)  of  a  VAR(p)  process  is  consid¬ 
ered  in  Section  3.2.1.  Some  properties  of  the  estimator  are  derived  in  Sections 

3.2.2  and  3.2.4  and  an  example  is  given  in  Section  3.2.3. 
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3.2.1  The  Estimator 

It  is  assumed  that  a  time  series  yi, ...  ,  yr  of  the  y  variables  is  available,  that 
is,  we  have  a  sample  of  size  T  for  each  of  the  K  variables  for  the  same  sample 
period.  In  addition,  p  presample  values  for  each  variable,  y~p+ 1, ...  ,yo,  are 
assumed  to  be  available.  Partitioning  a  multiple  time  series  in  sample  and 
presample  values  is  convenient  in  order  to  simplify  the  notation.  We  define 

Y  :=  (3/1 yT) 

B  :=  (v,A1,...,Ap) 

r  1  1 


L  yt— p+i  J 

Z  '■=  (Zo,  ■  ■  ■ ,  Zt-i 
U  :=  (ui, . . .  ,uT) 
y  :=  vec(T) 
j3  :=  vec (B) 
b  :=  vec {B') 
u  :=  vec (U) 

Here  vec  is  the  column  stacking  operator  as  defined  in  Appendix  A. 12. 

Using  this  notation,  for  t  =  1  the  VAR(p)  model  (3.1.1)  can  be 

written  compactly  as 

Y  =  BZ  +  U  (3.2.2) 


(I<  x  T), 
(Kx(Kp+  1)), 

((Kp+  1)  x  1), 


{{Kp  +  1)  x  T), 
(K  x  T), 

(■ KT  x  1), 

(( K2p+K )  x  1), 
(( K2p+K )  x  1), 
((KT  x  1). 


(3.2.1) 


or 

vec(U)  =  vec  (BZ)  +  vec  (U) 

=  (Z'  ®  IK)  vec (B)  +  vec (U) 

or 

y  =  (Z'  ®  IK )/3  +  u.  (3.2.3) 

Note  that  the  covariance  matrix  of  u  is 

Uu  =  IT  <g>  Eu.  (3.2.4) 

Thus,  multivariate  LS  estimation  (or  GLS  estimation)  of  (3  means  to  choose 
the  estimator  that  minimizes 
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S(f3)  =  u '(IT®SU)  1u  =  u'(lT  <gi  Su  1)u 

=  [y  -  (Z'  ®  /*)/3]'(Ir  ®  r-1)^  -  (Z'  ®  /*)/3] 

=  vec(y  -  BZ)'{It  ®  r-1)  vec(F  -  HZ) 

=  tr[(y-sz)'2;-1(y-BZ)] .  (3.2.5) 

In  order  to  find  the  minimum  of  this  function  we  note  that 

S(/3)  =  y'(/r®2:-1)y  +  ^(Z®//f)(/T®i7-1)(Z,®//f)/3 
-  2/3'(Z®/x)(7r®  2:-1)y 

=  y' (It  ®  +  /3'(ZZ'  ®  Z’“1)/3  -  2/3'(Z  ®  I^y. 

Hence, 

=  2(ZZ'  ®  Z-1)?  -  2(Z  ®  27“1)y. 

Equating  to  zero  gives  the  normal  equations 

(zz'  ®z-1)p  =  (z®s-1)y  (3.2.6) 

and,  consequently,  the  LS  estimator  is 

3  =  ((zz')-1  ®  i;u)(z  ®  r-^y 

=  ((ZZ')-1Z®/K)y.  (3.2.7) 

The  Hessian  of  S(/3), 


82S 

0/33/3' 


2(ZZ/®27-1), 


is  positive  definite  which  confirms  that  (3  is  indeed  a  minimizing  vector. 
Strictly  speaking,  for  these  results  to  hold,  it  has  to  be  assumed  that  ZZ' 
is  nonsingular.  This  result  will  hold  with  probability  1  if  yt  has  a  continuous 
distribution  which  will  always  be  assumed  in  the  following. 

It  may  be  worth  noting  that  the  multivariate  LS  estimator  /3  is  identical 
to  the  ordinary  LS  (OLS)  estimator  obtained  by  minimizing 

S((3)  =  u'u  =  [y  -  (Z'  ®  IK)p\'[ y  -  (Z'  ®  IK)0]  (3.2.8) 

(see  Problem  3.1).  This  result  is  due  to  Zellner  (1962)  who  showed  that  GLS 
and  LS  estimation  in  a  multiple  equation  model  are  identical  if  the  regressors 
in  all  equations  are  the  same. 

The  LS  estimator  can  be  written  in  different  ways  that  will  be  useful  later 
on: 

3  =  ((ZZ')_1Z  ®  IK)  \(Z'  ®  Ik)(3  +  u] 

=  P+((ZZ,)~1Z®1K)  u  (3.2.9) 

or 
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vec (B)  =  3=  {{ZZ'^Z  ®  1K)  vec(Y) 

=  vec  (YZ'(ZZ')-1). 

Thus, 

B  =  YZ'(ZZ')-1 

=  (BZ  +  U)Z\ZZ')~1 

=  B  +  UZ'(ZZ')~ 1.  (3.2.10) 

Another  possibility  for  deriving  this  estimator  results  from  postmultiplying 


Ut  —  BZt~  i  +  ut 

by  Z't_x  and  taking  expectations: 

E(ytZ't_  r)  =  BEiZ^Z'^).  (3.2.11) 

Estimating  E(ytZ't_1)  by 


1 

T 


\ yz 1 


and  E{Zt-iZ't_1)  by 


1 

T 


T 

Y.  Zt-1  z[_  : 

f=l 


1 

T 


ZZ\ 


we  obtain  the  normal  equations 

-YZ'  =  B-ZZ' 

T  T 

and,  hence,  B  =  Y Z\ZZ')~l .  Note  that  (3.2.11)  is  similar  but  not  identical 
to  the  system  of  Yule- Walker  equations  in  (2.1.37).  While  central  moments 
about  the  expectation  /i  =  E(yt)  are  considered  in  (2.1.37),  moments  about 
zero  are  used  in  (3.2.11). 

Yet  another  possibility  to  write  the  LS  estimator  is 
b  =  vec {B')  =  ( 1K  <g>  {ZZ'^Z)  vec(Y').  (3.2.12) 


In  this  form,  it  is  particularly  easy  to  see  that  multivariate  LS  estimation  is 
equivalent  to  OLS  estimation  of  each  of  the  K  equations  in  (3.1.1)  separately. 
Let  b'k  be  the  fc-th  row  of  B ,  that  is,  6;c  contains  all  the  parameters  of  the  fc-th 
equation.  Obviously  b'  =  (b[, ,  b'k).  Furthermore,  let  y ^  =  ( y^i ,  ■  •  ■ ,  yur)' 
be  the  time  series  available  for  the  fc-th  variable,  so  that 


vec(Y') 


2/(i) 

.  2 /(*)  . 


With  this  notation  bk  =  ( ZZ ')  1Zy^k)  is  the  OLS  estimator  of  the  model 
y^k)  =  Z'bk  +  M(fe),  where  =  (uk i,  . . .  ,ukT)'  and  b'  =  (b{, . . . ,  b'K). 
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3.2.2  Asymptotic  Properties  of  the  Least  Squares  Estimator 

Because  small  sample  properties  of  the  LS  estimator  are  difficult  to  derive 
analytically,  we  focus  on  asymptotic  properties.  Consistency  and  asymptotic 
normality  of  the  LS  estimator  are  easily  established  if  the  following  results 
hold: 

r  :=  plim  ZZ' /T  exists  and  is  nonsingular  (3.2.13) 

and 


1 

vf 


T 

J2vec  {utZ't_x) 

t=  1 


vec (U Z')  =  —j=(Z  ®  Ik) u 

Vt  Vt 

Af(o ,r®su), 

T  — >oo 


(3.2.14) 


where,  as  usual,  —>  denotes  convergence  in  distribution.  It  follows  from  a  the¬ 
orem  due  to  Mann  &  Wald  (1943)  that  these  results  are  true  under  suitable 
conditions  for  ut,  if  yt  is  a  stationary,  stable  VAR(p).  For  instance,  the  con¬ 
ditions  stated  in  the  following  definition  are  sufficient. 

Definition  3.1  ( Standard  White  Noise) 

A  white  noise  process  Ut  =  (un, . . . ,  UKt)'  is  called  standard  white  noise  if 
the  Ut  are  continuous  random  vectors  satisfying  E(ut)  =  0,  Eu  =  E(utu't)  is 
nonsingular,  ut  and  us  are  independent  for  s  ^  t,  and,  for  some  finite  constant 

c, 


E\uitUjtUktumt\  <  c  for  i,j,  k,  m  =  1, ... ,  K,  and  all  t. 


The  last  condition  means  that  all  fourth  moments  exist  and  are  bounded. 
Obviously,  if  the  ut  are  normally  distributed  (Gaussian)  they  satisfy  the  mo¬ 
ment  requirements.  With  this  definition  it  is  easy  to  state  conditions  for  con¬ 
sistency  and  asymptotic  normality  of  the  LS  estimator.  The  following  lemma 
will  be  essential  in  proving  these  large  sample  results. 

Lemma  3.1 

If  yt  is  a  stable,  Jv-dimensional  VAR(p)  process  as  in  (3.1.1)  with  standard 
white  noise  residuals  Ut,  then  (3.2.13)  and  (3.2.14)  hold.  ■ 

Proof:  See  Theorem  8.2.3  of  Fuller  (1976,  p.  340).  ■ 

The  lemma  holds  also  for  other  definitions  of  standard  white  noise.  For 
example,  the  convergence  result  in  (3.2.14)  follows  from  a  central  limit  theo¬ 
rem  for  martingale  differences  or  martingale  difference  arrays  (see  Proposition 
C.13)  by  noting  that  wt  =  vec (utZ't_f)  is  a  martingale  difference  sequence  un¬ 
der  quite  general  conditions.  The  convergence  result  in  (3.2.13)  may  then  be 
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obtained  from  a  suitable  weak  law  of  large  numbers  (see  Proposition  C.12).  In 
the  next  proposition  the  resulting  asymptotic  properties  of  the  LS  estimator 
are  stated  formally. 

Proposition  3.1  ( Asymptotic  Properties  of  the  LS  Estimator) 

Let  ijt  be  a  stable,  AT-dimensional  VAR(p)  process  as  in  (3.1.1)  with  stan¬ 
dard  white  noise  residuals,  B  =  Y  Z'  (ZZ')~X  is  the  LS  estimator  of  the  VAR 
coefficients  B  and  all  symbols  are  as  defined  in  (3.2.1).  Then, 

plim  B  =  B 

and 


Vf(/3  -f3)  =  VT  vec (B  -  B) 

4  A /'(o,  r-1  ®  ru) 

(3.2.15) 

or,  equivalently, 

\ZT(b  -h)  =  Vf  vec (§'  -  B'\ 

(3.2.16) 

where  T  =  plim  ZZ' /T . 

■ 

Proof:  Using  (3.2.10), 


plim(R  —  B)  =  plim 


=  0 


by  Lemma  3.1,  because  (3.2.14)  implies  plim  UZ'/T  =  0.  Thus,  the  consis¬ 
tency  of  B  is  established. 

Using  (3.2.9), 

Vf@-p)  =  Vt{{zzi)-1z®ik)u 

Thus,  by  Proposition  C.2(4)  of  Appendix  C,  VT({3  —  (3)  has  the  same  asymp¬ 
totic  distribution  as 


plim 


1 

VT 


(Z  ®  IK) u 


(r-1  ®  iK) 


l 

Vt 


(. Z  ®  IK) u. 


Hence,  the  asymptotic  distribution  of  VT(/3  —  (3)  is  normal  by  Lemma  3.1 
and  the  covariance  matrix  is 


(r-1  ®  iK)(r  ®  ru)(r-1  ®  iK)  =  r~x  ®  zu. 


The  result  (3.2.16)  can  be  established  with  similar  arguments  (see  Problem 
3.2).  ■ 
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As  mentioned  previously,  if  Ut  is  Gaussian  (normally  distributed)  white 
noise,  it  satisfies  the  conditions  of  Proposition  3.1  so  that  consistency  and 
asymptotic  normality  of  the  LS  estimator  are  ensured  for  stable  Gaussian 
(normally  distributed)  VAR(p)  processes  yt.  Note  that  normality  of  ut  implies 
normality  of  the  yt  for  stable  processes. 

In  order  to  assess  the  asymptotic  dispersion  of  the  LS  estimator,  we  need 
to  know  the  matrices  r  and  Eu.  From  (3.2.13)  an  obvious  consistent  estimator 
of  r  is 


T  =  ZZ'/T. 


(3.2.17) 


Because  Eu  =  E(utu't),  a  plausible  estimator  for  this  matrix  is 
T 

Su  =  ^Y,utu't=^UU'  =  ^(Y-BZ)(Y-BZ)' 

t=  1 

=  ^[Y  -  YZ'{ZZ')~1Z][Y  -  YZ\ZZ'YXZ\ 

=  ^Y[lr  -  Z'iZZ'Y^Yr  -  Z\ZZ'YXZ\Y' 

=  ^Y(h  -  Z'{ZZ'Y1Z)Y' .  (3.2.18) 

Often  an  adjustment  for  degrees  of  freedom  is  desired  because  in  a  regression 
with  fixed,  nonstochastic  regressors  this  leads  to  an  unbiased  estimator  of  the 
covariance  matrix.  Thus,  an  estimator 


E„  — 


T  -  Kp  -  1 


7  Eu 


(3.2.19) 


may  be  considered.  Note  that  there  are  Kp  +  1  parameters  in  each  of  the  K 
equations  of  (3.1.1)  and,  hence,  there  are  Kp+  1  parameters  in  each  equation 
of  the  system  (3.2.2).  Of  course,  ZJU  and  Eu  are  asymptotically  equivalent. 
They  are  consistent  estimators  of  Eu  if  the  conditions  of  Proposition  3.1  hold. 
In  fact,  a  bit  more  can  be  shown. 


Proposition  3.2  ( Asymptotic  Properties  of  the  White  Noise  Covariance  Ma¬ 
trix  Estimators) 

Let  yt  be  a  stable,  A-dimensional  VAR(p)  process  as  in  (3.1.1)  with  standard 
white  noise  innovations  and  let  B  be  an  estimator  of  the  VAR  coefficients 
B  so  that  y/T  vec (B  —  B)  converges  in  distribution.  Furthermore,  using  the 
symbols  from  (3.2.1),  suppose  that 

Eu  =  (Y-  BZ)  (Y  -  BZy/(T  -  c), 


where  c  is  a  fixed  constant.  Then 
plim Yf(Eu  -  UU'/T )  =  0. 


(3.2.20) 
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Proof: 


-(Y  -  BZ)(Y  -  BZ)' 


<*-*)(  + 

+  ™iB-B)'+UU' 


T 


T 


Under  the  conditions  of  the  proposition,  plim(i?  —  B)  =  0.  Hence,  by  Lemma 
3.1, 


plim  (B  -  B)ZU'/Vf  =  0 


and 


plim 


77' 

( b-b)—Vt(b-b y 


=  o 


(see  Appendix  C.l).  Thus, 

plim  Vf  [(y  -  BZ)(Y  -  BZ)' /T  -  UU' /T\  =  0. 

Therefore,  the  proposition  follows  by  noting  that  T/(T  —  c)  — >  1  as  T  — >  oo.B 

The  proposition  covers  both  estimators  Eu  and  Su.  It  implies  that  the 
feasible  estimators  Eu  and  Eu  have  the  same  asymptotic  properties  as  the 
estimator 


UU' 

~Y~ 


1  \  ’  / 

f  2 L.  UtUt 

t— 1 


which  is  based  on  the  unknown  true  residuals  and  is  therefore  not  feasible 
in  practice.  In  particular,  if  \/T  vec(UU' /T  —  Eu)  converges  in  distribution, 
VT  vec(Eu  —  Eu)  and  y/T  vec(Eu  —  Eu)  will  have  the  same  limiting  distribu¬ 
tion  (see  Proposition  C.2  of  Appendix  C.l).  Moreover,  it  can  be  shown  that 
the  asymptotic  distributions  are  independent  of  the  limiting  distribution  of 
the  LS  estimator  B.  Another  immediate  implication  of  Proposition  3.2  is  that 
Eu  and  Eu  are  consistent  estimators  of  Eu.  This  result  is  established  next. 

Corollary  3.2.1 

Under  the  conditions  of  Proposition  3.2, 
plim  Eu  =  plim  Eu  =  plim  UU' /T  =  Eu. 


Proof:  By  Proposition  3.2,  it  suffices  to  show  that  plim  UU' /T  =  Eu  which 
follows  from  Proposition  C.12(4)  because 
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E(fUU')  =^Y.E{utu't)  =  K 

'  '  t=  1 


and 


Var  (^=vec(UU')\  =  -^^Var[vec (utu't)]  <  ^0, 

where  g  is  a  constant  upper  bound  for  Var[vec (utu't)\.  This  bound  exists  be¬ 
cause  the  fourth  moments  of  Ut  are  bounded  by  Definition  3.1.  ■ 

If  yt  is  stable  with  standard  white  noise,  Proposition  3.1  and  Corollary 
3.2.1  imply  that  (/?*  —  ;  has  an  asymptotic  standard  normal  distribution. 

Here  (3 i  (/%)  is  the  i-th  component  of  (3  (/ 3 )  and  Si  is  the  square  root  of  the 
'i-th  diagonal  element  of 

(ZZ1)-1  ®EU.  (3.2.21) 

This  result  means  that  we  can  use  the  “f-ratios”  provided  by  common  re¬ 
gression  programs  in  setting  up  confidence  intervals  and  tests  for  individual 
coefficients.  The  critical  values  and  percentiles  may  be  based  on  the  asymp¬ 
totic  standard  normal  distribution.  Because  it  was  found  in  simulation  studies 
that  the  small  sample  distributions  of  the  “f-ratios”  have  fatter  tails  than  the 
standard  normal  distribution,  one  may  want  to  approximate  the  small  sam¬ 
ple  distribution  by  some  f-distribution.  The  question  is  then  what  number 
of  degrees  of  freedom  (d.f.)  should  be  used.  The  overall  model  (3.2.3)  may 
suggest  a  choice  of  d.f.  =  KT  —  K2p  —  K  because  in  a  standard  regression 
model  with  nonstochastic  regressors  the  d.f.  of  the  “t-ratios”  are  equal  to  the 
sample  size  minus  the  number  of  estimated  parameters.  In  the  present  case, 
it  seems  also  reasonable  to  use  d.f.  =  T  —  Kp  —  1  because  the  multivari¬ 
ate  LS  estimator  is  identical  to  the  LS  estimator  obtained  for  each  of  the  K 
equations  in  (3.2.2)  separately.  In  a  separate  regression  for  each  individual 
equation,  we  would  have  T  observations  and  Kp  +  1  parameters.  If  the  sam¬ 
ple  size  T  is  large  and,  thus,  the  number  of  degrees  of  freedom  is  large,  the 
corresponding  t-distribution  will  be  very  close  to  the  standard  normal  so  that 
the  choice  between  the  two  becomes  irrelevant  for  large  samples.  Before  we 
look  a  little  further  into  the  problem  of  choosing  appropriate  critical  values, 
let  us  illustrate  the  foregoing  results  by  an  example. 


3.2.3  An  Example 

As  an  example,  we  consider  a  three-dimensional  system  consisting  of  first 
differences  of  the  logarithms  of  quarterly,  seasonally  adjusted  West  German 
fixed  investment  (yi),  disposable  income  (2/2),  and  consumption  expenditures 
(2/3)  from  File  El  of  the  data  sets  associated  with  this  book.  We  use  only 
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data  from  1960-1978  and  reserve  the  data  for  1979-1982  for  a  subsequent 
analysis.  The  original  data  and  first  differences  of  logarithms  are  plotted  in 
Figures  3.1  and  3.2,  respectively.  The  original  data  have  a  trend  and  are 
thus  considered  to  be  nonstationary.  The  trend  is  removed  by  taking  first 
differences  of  logarithms.  We  will  discuss  this  issue  in  some  more  detail  in 
Part  II.  Note  that  the  value  for  1960.1  is  lost  in  the  differenced  series. 


o 


Fig.  3.1.  West  German  investment,  income,  and  consumption  data. 


Let  us  assume  that  the  data  have  been  generated  by  a  VAR(2)  process. 
The  choice  of  the  VAR  order  p  =  2  is  arbitrary  at  this  point.  In  the  next 
chapter,  criteria  for  choosing  the  VAR  order  will  be  considered.  Because  the 
VAR  order  is  two,  we  keep  the  first  two  observations  of  the  differenced  series 
as  presample  values  and  use  a  sample  size  of  T  =  73.  Thus,  we  have  a  (3  x  73) 
matrix  Y,  B  =  (v,  Ai,A2)  is  (3x7),  Z  is  (7  x  73)  and  (3  and  b  are  both 
(21  x  1)  vectors. 

The  LS  estimates  are 

B  =  {V,A1,A2)=YZ'(ZZ')-1 

'  -.017  -.320  .146  .961  -.161  .115  .934 

.016  .044  -.153  .289  .050  .019  -.010 

.013  -.002  .225  -.264  .034  .355  -.022 


(3.2.22) 
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Fig.  3.2.  First  differences  of  logarithms  of  West  German  investment,  income,  and 
consumption. 

To  check  the  stability  of  the  estimated  process,  we  determine  the  roots  of  the 
polynomial  det(/a  —  A±z  —  A2z2)  which  is  easily  seen  to  have  degree  6.  Its 
roots  are 
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z i  =  1.753,  z2  =  -2.694,  z3/4  =  -0.320  ±  2.008i,  z5/6  =  -1.285  ±  1.280*. 

Note  that  these  roots  have  been  computed  using  higher  precision  than  the 
three  digits  in  (3.2.22).  They  all  have  modulus  greater  than  1  and,  hence,  the 
stability  condition  is  satisfied. 

We  get 

Su  =  - - - (YY'  -  YZ'fZZ'^ZY') 

T-Kp-r  v  ' 

'  21.30  .72  1.23  ' 

.72  1.37  .61  x  10"4  (3.2.23) 

1.23  .61  .89  _ 

as  estimate  of  the  residual  covariance  matrix  £u.  Furthermore, 

T  1  =  ( ZZ'/T )  1 


"  .14  .17  -.69  -2.51  .10  -.67  -2.57  " 

•  7.39  1.24  -10.56  1.80  1.08  -8.70 

•  •  139.81  -87.40  -4.58  30.21  -50.88 

=  T  •  •  •  207.22  .84  -55.35  73.82 

•  •  •  •  7.33  -.03  -9.31 

•  •  •  •  •  134.19  —82.64 

•  •  •  •  •  •  207.71 


Dividing  the  elements  of  B  by  square  roots  of  the  corresponding  diagonal 
elements  of  ( ZZ ')_1  ®  £u  we  get  the  matrix  of  t-ratios: 

'  -0.97  -2.55  0.27  1.45  -1.29  0.21  1.41  ' 

3.60  1.38  -1.10  1.71  1.58  0.14  -0.06  .  (3.2.24) 

3.67  -0.09  2.01  -1.94  1.33  3.24  -0.16 

We  may  compare  these  quantities  with  critical  values  from  a  t-distribution 
with  cl.f.  =  KT  —  K2p  —  K  =  198  or  d.f.  =  T  —  Kp  —  1  =  66.  In  both  cases,  we 
get  critical  values  of  approximately  ±2  for  a  two-tailed  test  with  significance 
level  5%.  Thus,  the  critical  values  are  approximately  the  same  as  those  from 
a  standard  normal  distribution. 

Apparently  quite  a  few  coefficients  are  not  significant  under  this  criterion. 
This  observation  suggests  that  the  model  contains  unnecessarily  many  free 
parameters.  In  subsequent  chapters,  we  will  discuss  the  problem  of  choosing 
the  VAR  order  and  possible  restrictions  for  the  coefficients.  Also,  before  an 
estimated  model  is  used  for  forecasting  and  analysis  purposes,  the  assump¬ 
tions  underlying  the  analysis  should  be  checked  carefully.  Checking  the  model 
adequacy  will  be  treated  in  greater  detail  in  Chapter  4. 

3.2.4  Small  Sample  Properties  of  the  LS  Estimator 

As  mentioned  earlier,  it  is  difficult  to  derive  small  sample  properties  of  the 
LS  estimator  analytically.  In  such  a  case  it  is  sometimes  helpful  to  use  Monte 
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Carlo  methods  to  get  some  idea  about  the  small  sample  properties.  In  a  Monte 
Carlo  analysis,  specific  processes  are  used  to  artificially  generate  a  large  num¬ 
ber  of  time  series.  Then  a  set  of  estimates  is  computed  for  each  multiple  time 
series  generated  and  the  properties  of  the  resulting  empirical  distributions  of 
these  estimates  are  studied  (see  Appendix  D).  Such  an  approach  usually  per¬ 
mits  rather  limited  conclusions  only  because  the  findings  may  depend  on  the 
particular  processes  used  for  generating  the  time  series.  Nevertheless,  such 
exercises  give  some  insight  into  the  small  sample  properties  of  estimators. 

In  the  following,  we  use  the  bivariate  VAR(2)  example  process  (2.1.15), 


'  .02  " 

'  .5  .1  " 

o 

O 

yt  = 

.03 

+ 

.4  .5 

yt- 1  + 

.25  0  J 

(3.2.25) 


with  error  covariance  matrix 


9  0 
0  4 


x  1CT4 


(3.2.26) 


to  investigate  the  small  sample  properties  of  the  multivariate  LS  estimator. 
With  this  process  we  have  generated  1000  bivariate  time  series  of  length  T  = 
30  plus  2  presample  values  using  independent  standard  normal  errors,  that  is, 
ut  ~  7V(0,  Su).  Thus  the  1000  bivariate  time  series  are  generated  by  a  stable 
Gaussian  process  so  that  Propositions  3.1  and  3.2  provide  the  asymptotic 
properties  of  the  LS  estimators. 

In  Table  3.1,  some  empirical  results  are  given.  In  particular,  the  empirical 
mean,  variance,  and  mean  squared  error  (MSE)  of  each  parameter  estimator 
are  given.  Obviously,  the  empirical  means  differ  from  the  actual  values  of 
the  coefficients.  However,  measuring  the  estimation  precision  by  the  empirical 
variance  (average  squared  deviation  from  the  mean  in  1000  samples)  or  MSE 
(average  squared  deviation  from  the  true  parameter  value),  the  coefficients 
are  seen  to  be  estimated  quite  precisely  even  with  a  sample  size  as  small  as 
T  =  30.  This  is  partly  a  consequence  of  the  special  properties  of  the  process. 

In  Table  3.1,  empirical  percentiles  of  the  t-ratios  are  also  given  together 
with  the  corresponding  percentiles  from  the  t-  and  standard  normal  distribu¬ 
tions  (d.f.  =  oo).  Even  with  the  presently  considered  relatively  small  sample 
size  the  percentiles  of  the  three  distributions  that  might  be  used  for  inference 
do  not  differ  much.  Consequently,  it  does  not  matter  much  which  of  the  the¬ 
oretical  percentiles  are  used,  in  particular,  because  the  empirical  percentiles, 
in  many  cases,  differ  quite  a  bit  from  the  corresponding  theoretical  quantities. 
This  example  shows  that  the  asymptotic  results  have  to  be  used  cautiously 
in  setting  up  small  sample  tests  and  confidence  intervals.  On  the  other  hand, 
this  example  also  demonstrates  that  the  asymptotic  theory  does  provide  some 
guidance  for  inference.  For  example,  the  empirical  95th  percentiles  of  all  co¬ 
efficients  lie  between  the  90th  and  the  99th  percentile  of  the  standard  normal 
distribution  given  in  the  last  row  of  the  table.  Of  course,  this  is  just  one 
example  and  not  a  general  finding. 
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Table  3.1.  Empirical  percentiles  of  t-ratios  of  parameter  estimates  for  the  example 
process  and  actual  percentiles  of  f-distributions  for  sample  size  T  =  30 


parameter 

empirical 

mean  variance  MSE 

empirical  percentiles  of  t-ratios 

1. 

5. 

10. 

50. 

90. 

95. 

99. 

£ 

o 

to 

.041 

.0011 

.0015 

-1.91 

-1.04 

-0.64 

0.62 

1.92 

2.29 

3.12 

i*2  =  .03 

.038 

.0005 

.0006 

-2.30 

-1.40 

-1.02 

0.25 

1.65 

2.11 

2.83 

an,i  =  .5 

.41 

.041 

.049 

-2.78 

-2.18 

-1.74 

-0.43 

0.92 

1.28 

2.01 

021,1  =  .4 

.40 

.018 

.018 

-2.61 

-1.74 

-1.28 

0.04 

1.28 

1.71 

2.65 

012,1  =  .1 

.10 

.078 

.078 

-2.27 

-1.67 

-1.35 

-0.03 

1.29 

1.67 

2.38 

022,1  =  .5 

.44 

.030 

.034 

-2.69 

-1.97 

-1.59 

-0.35 

0.89 

1.30 

2.06 

Oil, 2  =  0 

-.05 

.056 

.058 

-2.75 

-1.93 

-1.50 

-0.24 

1.02 

1.38 

2.09 

o;  21,2  —  .25 

.29 

.023 

.024 

-1.99 

-1.32 

-0.99 

0.20 

1.45 

1.81 

2.48 

ctl2,2  —  0 

-.07 

.053 

.058 

-2.48 

-1.91 

-1.61 

-0.28 

0.97 

1.39 

2.03 

022,2  =  0 

-.01 

.023 

.024 

-2.71 

-1.72 

-1.36 

-0.03 

1.18 

1.53 

2.18 

degrees  of 

percentiles  of  f-distributions 

freedom(d.f.( 

) 

1. 

5. 

10. 

50. 

90. 

95. 

99. 

T  —  Kp  —  1 

=  25 

-2.49 

-1.71 

-1.32 

0 

1.32 

1.71 

2.49 

K(T  - 

-  Kp  - 

1)  =  50 

-2.41 

-1.68 

-1.30 

0 

1.30 

1.68 

2.41 

oo 

-2.33 

-1.65 

-1.28 

0 

1.28 

1.65 

2.33 

(normal  distribution) 


In  an  extensive  study,  Nankervis  &  Savin  (1988)  investigated  the  small 
sample  distribution  of  the  “t-statistic”  for  the  parameter  of  a  univariate  AR(1) 
process.  They  found  that  it  differs  quite  substantially  from  the  corresponding 
f-distribution,  especially  if  the  sample  size  is  small  (T  <  100)  and  the  param¬ 
eter  lies  close  to  the  instability  region.  Analytical  results  on  the  bias  in  esti¬ 
mating  VAR  models  were  derived  by  Nicholls  &  Pope  (1988)  and  Tjpstheim  & 
Paulsen  (1983).  What  should  be  learned  from  our  Monte  Carlo  investigation 
and  these  remarks  is  that  asymptotic  distributions  in  the  present  context  can 
only  be  used  as  rough  guidelines  for  small  sample  inference.  That,  however, 
is  much  better  than  having  no  guidance  at  all. 


3.3  Least  Squares  Estimation  with  Mean- Adjusted  Data 
and  Yule- Walker  Estimation 

3.3.1  Estimation  when  the  Process  Mean  Is  Known 

Occasionally  a  VAR(p)  model  is  given  in  mean-adjusted  form, 

(yt  ~  t1)  =  —  n)  +  ■  ■  ■  +  Ap(yt_p  —  p)  +  ut.  (3.3.1) 


Multivariate  LS  estimation  of  this  model  form  is  straightforward  if  the  mean 
vector  /i  is  known.  Defining 
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Y°  ■=  (Ui  -  v)  ( KxT ), 

A:=(Au...,Ap)  (KxKp), 

Vt-l* 

Y?  ■=  ■  ( Kp  x  1), 

..  (3.3.2) 

L  yt-p+i  ^  a4  J  v  ’ 

X:=(r0°,...,K°_1)  (KpxT), 

y°:=vec(y°)  (KT  x  1), 

a  :=  vec(A)  (. K2p  x  1), 

we  can  write  (3.3.1),  for  t  =  1 ,T,  compactly  as 

F°  =  AX  +  U  (3.3.3) 

or 

y°  =  (x*  ®  JK )a  +  u,  (3.3.4) 

where  U  and  u  are  defined  as  in  (3.2.1).  The  LS  estimator  is  easily  seen  to  be 
a  =  ((XX')-1X0lK)y°  (3.3.5) 

or 

A  =  Y°X,(XX')~1.  (3.3.6) 

If  yt  is  stable  and  ut  is  standard  white  noise,  it  can  be  shown  that 

VT(a-a)-^AT(0,S&),  (3.3.7) 

where 

Sz  =  rY{oy1®xu  (3.3.8) 

and  Ty(0)  :=  E(Yt°Yt0'). 


3.3.2  Estimation  of  the  Process  Mean 

Usually  p  will  not  be  known  in  advance.  In  that  case,  it  may  be  estimated  by 
the  vector  of  sample  means, 

1  T 

y=f^2yt-  (3-3-9) 

t=  1 


Using  (3.3.1),  y  can  be  written  as 
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V  =  y  +  A1 
+A. 


y+  ^(2/0  —  vt)  —  y 


y  +  j;{y-p+i  h —  +  yo  -  yr-p+i  - 


—  Vt)  ~  y 


4e*- 


T 


t= 1 


Hence, 


Uk-A-l - Ap)(y  -y)  =  ^zT  +  ^  E 


(3.3.10) 


where 


p 

ZT  =  ^2  Ai 

i—1 


i—1 

E(y°-j _  yT~^ 

3=0 


Evidently, 


E(zT/Vf)  =  —^=E(zt)  =  0 


and 


Varizr/VT)  =  -Var (zT)  — »  0 

T  T  —>  oo 


because  yt  is  stable.  In  other  words,  Zt/\/T  converges  to  zero  in  mean  square. 
It  follows  that  Vt(Ik  —  Ai  —  ■  ■  ■  —  Ap)(y  —  y)  has  the  same  asymptotic 
distribution  as  J2ut/VT  (see  Appendix  C,  Proposition  C.2).  Hence,  noting 
that,  by  a  central  limit  theorem  (e.g.,  Fuller  (1976)  or  Proposition  C.13), 


-J=^>t4A7(0  ,Su),  (3.3.11) 

v  1  t=i 

if  Ut  is  standard  white  noise,  we  get  the  following  result: 

Proposition  3.3  ( Asymptotic  Properties  of  the  Sample  Mean ) 

If  the  VAR(p)  process  yt  given  in  (3.3.1)  is  stable  and  ut  is  standard  white 
noise,  then 


Vf{y-y)^Af{Q,Ev), 

where 


(3.3.12) 


Ey  —  {Ik  —  Ai  —  ■  ■  ■  —  Ap)  1Eu(Jk  —  A\ 

In  particular,  plim  y  =  y. 


Ap) 


/-i 
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The  proposition  follows  from  (3.3.10),  (3.3.11),  and  Proposition  C.15  of 
Appendix  C.  The  limiting  distribution  in  (3.3.11)  holds  even  in  small  samples 
for  Gaussian  white  noise  wt. 

Because  y  =  ( I /<  —  A\  —  •  •  •  —  Ap)~1u  (see  Chapter  2,  Section  2.1),  an 
alternative  estimator  for  the  process  mean  is  obtained  from  the  LS  estimator 
of  the  previous  section: 

(Ik  —  M - Ap)-^.  (3.3.13) 

Using  again  Proposition  C.15  of  Appendix  C,  this  estimator  is  also  consistent 
and  has  an  asymptotic  normal  distribution, 

(3.3.14) 

provided  the  conditions  of  Proposition  3.1  are  satisfied.  It  can  be  shown  that 

=  %  (3-3.15) 

and,  hence,  the  estimators  y  and  y  for  y  are  asymptotically  equivalent  (see  Sec¬ 
tion  3.4).  This  result  suggests  that  it  does  not  matter  asymptotically  whether 
the  mean  is  estimated  separately  or  jointly  with  the  other  VAR  coefficients. 
While  this  holds  asymptotically,  it  will  usually  matter  in  small  samples  which 
estimator  is  used.  An  example  will  be  given  shortly. 


3.3.3  Estimation  with  Unknown  Process  Mean 

If  the  mean  vector  y  is  unknown,  it  may  be  replaced  by  y  in  the  vectors  and 
matrices  in  (3.3.2)  giving  X,Y°  and  so  on.  The  resulting  LS  estimator, 

a  =  ((xx')-1x®iK)y°, 

is  asymptotically  equivalent  to  a.  More  precisely,  it  can  be  shown  that,  under 
the  conditions  of  Proposition  3.3, 

v/T(a-a)4Af(0,rY(0)-1®T,li),  (3.3.16) 

where  ly( 0)  :=  E(Y^Y^r).  This  result  will  be  discussed  further  in  the  next 
section  on  maximum  likelihood  estimation  for  Gaussian  processes. 


3.3.4  The  Yule- Walker  Estimator 

The  LS  estimator  can  also  be  derived  from  the  Yule- Walker  equations  given 
in  Chapter  2,  (2.1.37).  They  imply 
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ry{h)  —  [Ai, . . . ,  Ap\ 


r.y(h  - 1)  ■ 
ry{h-p)  _ 


or 

[ry(i),...,ry(p)\ 


[Ai, . . . ,  Ap] 

ArY(  o) 


h  >  0, 

^»(0) 

ly(—P  +  1) 


and,  hence, 


ry(P-i) 

'v(O) 


(3.3.17) 


A  =  [ry(i),...,ry(p)]rY(0)-1. 

Estimating  TY( 0)  by  XX' IT  and  [_/y(l),  •  •  ■  ,Ty{p)\  by  Y°X'/T,  the  resulting 
estimator  is  just  the  LS  estimator, 


A  =  Y°X'(XX')-1. 


(3.3.18) 


Alternatively,  the  moment  matrices  Ty(h )  may  be  estimated  using  as 
many  data  as  are  available,  including  the  presample  values.  Thus,  if  a  sample 
t/i , . .  . ,  ijt  and  p  presample  observations  y~p+i  ,...,2/0  are  available,  p  may  be 
estimated  as 

1  T 

y*  =  yt 

T+pt=^U  1 


and  ry(h)  may  be  estimated  as 

ry{h)  =  T  +  1  h  (yt  -  V*)(yt-h  -  y*)' ■  (3.3.19) 

t=-p-\-h-\- 1 

Using  these  estimators  in  (3.3.17),  the  so-called  Yule-Walker  estimator  for 
A  is  obtained.  For  stable  processes,  this  estimator  has  the  same  asymptotic 
properties  as  the  LS  estimator.  However,  it  may  have  less  attractive  small 
sample  properties  (e.g.,  Tjpstheim  &  Paulsen  (1983)). 

The  Yule- Walker  estimator  always  produces  estimates  in  the  stability  re¬ 
gion  (see  Brockwell  &  Davis  (1987,  §8.1)  for  a  discussion  of  the  univariate 
case).  In  other  words,  the  estimated  process  is  always  stable.  This  property 
is  sometimes  regarded  as  an  advantage  of  the  Yule- Walker  estimator.  It  is 
responsible  for  possibly  considerable  bias  of  the  estimator,  however.  Also,  in 
practice,  it  may  not  be  known  a  priori  whether  the  data  generation  process  of 
a  given  multiple  time  series  is  stable.  In  the  unstable  case,  LS  and  Yule-Walker 
estimation  are  not  asymptotically  equivalent  anymore  (see  also  the  discussion 
in  Reinsel  (1993,  Section  4.4)).  Therefore,  enforcing  stability  may  not  be  a 
good  strategy  in  practice.  The  LS  estimator  is  usually  used  in  the  following. 
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3.3.5  An  Example 

To  illustrate  the  results  of  this  section,  we  use  again  the  West  German  invest¬ 
ment,  income,  and  consumption  data.  The  variables  y\,  y2,  and  y3  are  defined 
as  in  Section  3.2.3,  the  sample  period  ranges  from  1960.4  to  1978.4,  that  is, 
T  =  73  and  the  data  for  1960.2  and  1960.3  are  used  as  presample  values. 
Using  only  the  sample  values  we  get 

"  .018  ' 

y  =  .020  (3.3.20) 

.020 

which  is  different,  though  not  substantially  so,  from 

[  .017  ' 

y=  (13-  A1-A2)~1v  =  .020  (3.3.21) 

.020 

as  obtained  from  the  LS  estimates  in  (3.2.22). 

Subtracting  the  sample  means  from  the  data  we  get,  based  on  (3.3.18), 

^  ^  r  -.319  .143  .960  -.160  .112  .933  ' 

A  =  (A1,A2)  =  .044  -.153  .288  .050  .019  -.010  .  (3.3.22) 

-.002  .224  -.264  .034  .354  -.023 

This  estimate  is  clearly  distinct  from  the  corresponding  part  of  (3.2.22),  al¬ 
though  the  two  estimates  do  not  differ  dramatically. 

If  the  two  presample  values  are  used  in  estimating  the  process  means  and 
moment  matrices  we  get 

'  -.319  .147  .959  -.160  .115  .932  ' 

Arw  =  .044  -.152  .286  .050  .020  -.012  (3.3.23) 

-.002  .225  -.264  .034  .355  -.022 

which  is  the  Yule-Walker  estimate.  Although  the  sample  size  is  moderate, 

there  is  a  slight  difference  between  the  estimates  in  (3.3.22)  and  (3.3.23). 

3.4  Maximum  Likelihood  Estimation 

3.4.1  The  Likelihood  Function 

Assuming  that  the  distribution  of  the  process  is  known,  maximum  likelihood 
(ML)  estimation  is  an  alternative  to  LS  estimation.  We  will  consider  ML 
estimation  under  the  assumption  that  the  VAR(p)  process  yt  is  Gaussian. 
More  precisely, 
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u  =  vec(U)  = 


u  1 


UT 


'  Af( 0,  It  <S>  Xu). 


(3.4.1) 


In  other  words,  the  probability  density  of  u  is 
1 


/u(u)  = 


Moreover, 


(2n)KT/2 


exp 


—  ^u'(/r  ®  Su  1)u 


(3.4.2) 


Ik 

0 

.  0 

.  0 

^  ■ 

Ik 

0 

.  0 

Ap 

-Ap— i 

■  Ik 

0 

0 

— 

0 

0 

Ap 

.  Ik 

F  —Ai 

— A2  ■  • 

■  —Ap 

—A2 

-a3  .. 

.  0 

—Ap 

0  .. 

.  0 

(yb  -  /*), 

O 

0  .. 

■  0  1 

(y-/0 


(3.4.3) 


where  y  :=  vec(V)  and  /t*  :=  (//,...,//)'  are  (TK  x  1)  vectors  and  Y0  := 
(y'o,  •  •  • ,  y'-p+iY  and  n  :=  (//, . . . ,  //)'  are  (A>  x  1).  Consequently,  du/dy' 
is  a  lower  triangular  matrix  with  unit  diagonal  which  has  unit  determinant. 
Hence,  using  that  u  =  y  —  fi*  —  ( X '  ®  //<)«, 

/y(y)  =  I77  /u(u) 


<9y' 


(2tt)kt/2 


,r1/2 


x(y  -  li*  -  {X'  ®  IK)a)  • 


(3.4.4) 


where  AT  and  cc  are  as  defined  in  (3.3.2).  For  simplicity,  the  initial  values  Y0 
are  assumed  to  be  given  fixed  numbers.  Hence,  we  get  a  log- likelihood  function 


In  l(n,u,Eu) 
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KT  T 

-  — —  In  2tt  —  —  In  |  Uu 


-7,  [y  -  -  (X'  ®  1k)ol\  {It  ®  Zu  *)  [y  -  T*  -  (-Y'  ®  7x)a] 


2 

KT  T ,  ,  ^  ,  1 

—  In27r-  -ln|r„|  -  ~X 


( Vt  -  v)  -  X  “  M) 


i=l 


XX, 


-1 


( yt  -  iA  -^2Myt-i  - 


i=l 


AT  T 

= - —  In  2ir  —  —  In  |  Su 


n  'y  '  (  Vt  y  ]  Aiyt-i  J  Su  (  yt  y  ]  Aiyt-i 


j\1k  A'i  J  su  1  X  (  y*  -  X  Aiyt-i 


T 


-yAt'  Lk  -  X  Az  I  Su  1  1K  ~'y^Al  jj. 


=  -^ln27r-|ln|i;il|-itr [{Y°  **  AX)' E? <X°  -  AX)],  (3.4.5) 

where  Y°  :=  {yi  —  y, ...  ,y?  —  y)  and  A  :=  (T1; . . . ,  Ap)  are  as  defined  in 
(3.3.2).  These  different  expressions  of  the  log-likelihood  function  will  be  useful 
in  the  following. 


3.4.2  The  ML  Estimators 

In  order  to  determine  the  ML  estimators  of  y,  a,  and  Su,  the  system  of  first 
order  partial  derivatives  is  needed: 

-t(i,  -  V .4,)  r-1  ( -Vt.), 


'  v~» — 1 


=  [Ik-  A{i®Ik)]  Z, 


XX yt  ~  V  ~  AYt-i) 


(3.4.6) 


where  Y?  is  as  defined  in  (3.3.2)  and  j  :=  (1, . . . ,  1)'  is  a  (px  1)  vector  of  ones, 
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=  (3-4.7) 

(Oln/  T  1 

—  =  --£?  +  -£ -\Y°-AX)(Y0-AXyS-\  (3.4.8) 

Equating  to  zero  gives  the  system  of  normal  equations  which  can  be  solved 
for  the  estimators: 


(3.4.9) 


u  =  {{XX')-1X®IK){y-'jj,*),  (3.4.10) 

Su  =  ^(Y°  -  AX)(Y°  -  AX)' ,  (3.4.11) 

where  X  and  Y°  are  obtained  from  X  and  Y° ,  respectively,  by  replacing  /j 
with  Jl. 


3.4.3  Properties  of  the  ML  Estimators 

Comparing  these  results  with  the  LS  estimators  obtained  in  Section  3.3,  it 
turns  out  that  the  ML  estimators  of  /.t  and  a  are  identical  to  the  LS  estimators. 
Thus,  Jl  and  5  are  consistent  estimators  if  yt  is  a  stationary,  stable  Gaussian 
VAR(p)  process  and  VT(j1  —  /i)  and  \/T{ol  —  ct)  are  asymptotically  normally 
distributed.  This  result  also  follows  from  a  more  general  maximum  likelihood 
theory  (see  Appendix  C.6).  In  fact,  that  theory  implies  that  the  covariance 
matrix  of  the  asymptotic  distribution  of  the  ML  estimators  is  the  limit  of  T 
times  the  inverse  information  matrix.  The  information  matrix  is 

1(8)  =  -E  ln/  (3.4.12) 

L  dddS1  \ 

where  S'  :=  (//,  a' ,  a')  with  a  \=  vech(A'll).  Note  that  vech  is  a  column  stack¬ 
ing  operator  that  stacks  only  the  elements  on  and  below  the  main  diagonal 
of  Eu.  It  is  related  to  the  vec  operator  by  the  (^K(K  +  1)  x  K2)  elimina¬ 
tion  matrix  L  k,  that  is,  vech  (A,,)  =  Lifvec(Au)  or,  defining  uj  :=  vec(A„), 
er  =  Lj,-w  (see  Appendix  A.  12).  For  instance,  for  K  =  3, 

on  Oil  o  13 

u>  =  vec(A„)  =  vec  cr12  o2  2  o23 

o  13  (T23  o33 

=  (oii,ai2,ai3,ai2,a22,a23,ai3,a23,a33)' 


and 
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er  =  vech(I7u)  =  L3  uj  = 


o-n 
a 12 
Cl3 
022 
0"23 
^33 


(3.4.13) 


Note  that  in  8  we  collect  only  the  potentially  different  elements  of  Su. 

The  asymptotic  covariance  matrix  of  the  ML  estimator  8  is  known  to  be 


lim  [X{8)/T]~ 
1  — >00 


(3.4.14) 


In  order  to  determine  this  matrix,  we  need  the  second  order  partial  derivatives 
of  the  log-likelihood.  From  (3.4.6)  to  (3.4.8)  we  get 


d 2  In  l 
d/j,  d/i 

d 2  In  l 


7  =  ~T  Ik-J2A>  ^  > 


<i2|°7  = 


du )  du:' 


1 . 


where  u>  =  vec(Uu)  (see  Problem  3.3), 
d 2  In  l 


(3.4.15) 

(3.4.16) 

(3.4.17) 


a^ac,'  = 


(3.4.18) 


(see  Problem  3.4), 

32ln  l  1  1  _ 

dudu'  = 


,T  9vec(^7,)  dvec(U) 


(3.4.19) 


(see  Problem  3.5),  and 


32  In  l 
du:  da' 


(/K  ®  LNY') 


3  vec(A') 
da' 


+  {UX' ®  I K) 


(3.4.20) 


(see  Problem  3.6). 
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It  is  obvious  from  (3.4.18)  that 

lim  T"E  (wsi)  = 0  (3A21) 

because  E(YJt  y^-i/T)  — *  0.  Furthermore,  from  (3.4.19),  it  follows  that 


E 


d2  In  l 
du  dji' 


because  E(U)  = 
(3.4.20),  we  have 


=  0  (3.4.22) 

0  and  dvec(U')/dn'  is  a  matrix  of  constants.  Moreover,  from 


lim  T~1E 


f  d’2  hi  l  \ 
\du)  da' ) 


=  0 


(3.4.23) 


because  E(UX'/T)  — »  0.  Thus,  limX(<5)/T  is  block  diagonal  and  we  get  the 
asymptotic  distributions  of  /x,  a ,  and  <x  as  follows. 

Multiplying  minus  the  inverse  of  (3.4.15)  by  T  gives  the  asymptotic  co- 
variance  matrix  of  the  ML  estimator  for  the  mean  vector  ji,  that  is, 


-1 


Eu 


(3.4.24) 


Hence,  ji  has  the  same  asymptotic  distribution  as  y  (see  Proposition  3.3).  In 
other  words,  the  two  estimators  for  (i  are  asymptotically  equivalent  and,  un¬ 
der  the  present  conditions,  this  fact  implies  that  y  is  asymptotically  efficient 
because  the  ML  estimator  is  asymptotically  efficient.  The  asymptotic  equiv¬ 
alence  of  JI  and  y  can  also  be  seen  from  (3.4.9)  (see  the  argument  prior  to 
Proposition  3.3  and  Problem  3.7). 

Taking  the  limit  of  T-1  times  the  expectation  of  minus  (3.4.16)  gives 
-Ty(O)  ®  E~x.  Note  that  E(XX'/T)  is  not  strictly  equal  to  /Y( 0)  because  we 
have  assumed  fixed  initial  values  j/_p+i, . . .  ,y0.  However,  asymptotically,  as 
T  goes  to  infinity,  the  impact  of  the  initial  values  vanishes.  Thus,  we  get 


Vf{a  -  a)  4  A^(0,  /> ( 0)"1  ®  Eu). 


(3.4.25) 


Of  course,  this  result  also  follows  from  the  equivalence  of  the  ML  and  LS 
estimators. 


Noting  that  E(UU')  =  TEU,  it  follows  from  (3.4.17)  that 


E 


/  d 2  In  l  \ 

\du)  du>'  J 


T 

~2 


(3.4.26) 


Denoting  by  D k  the  ( K 2  x  \K{K  +  1))  duplication  matrix  (see  Appendix 
A. 12)  so  that  us  =  D^cr,  we  get 
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d2  In  l  du)'  d 2  In  l  doj  .  d 2  In  l 

da  da'  da  dui  du'  da'  K  du  duj'  K 

and,  hence, 

Vf{a  -  a)  4  A/"(0,  Ag)  (3.4.27) 

with 

=  ^TE{ja^a)  =‘>[V'k{Zu1®Zu1)VKY1 

=  2D+  (Au  ®  AU)D+',  (3.4.28) 

where  D+  =  {I>'KT>K)-1iyK  is  the  Moore-Penrose  generalized  inverse  of  the 
duplication  matrix  and  Rule  (17)  from  Appendix  A. 12  has  been  used.  In 
summary,  we  get  the  following  proposition. 

Proposition  3.4  ( Asymptotic  Properties  of  ML  Estimators) 

Let  yt  be  a  stationary,  stable  Gaussian  VAR(p)  process  as  in  (3.3.1).  Then  the 
ML  estimators  ft,  a ,  and  a  =  vecli(A,u)  given  in  (3.4.9)-(3.4.11)  are  consistent 
and 


YT 


if  -  d 

OL  —  Oi 

a  —  a 


>N  0, 


Zfi  0  0  1\ 

0  Ag,  0  I  , 

o  o  As  J  / 


(3.4.29) 


so  that  p  is  asymptotically  independent  of  a  and  A„  and  a  is  asymptotically 
independent  of  fi  and  Au.  The  covariance  matrices  are 


Aa  —  1  y  (0)  1  ®  Au, 

Ag.  =  2D^(A„  ®  AU)D+  . 

They  may  be  estimated  consistently  by  replacing  the  unknown  quantities  by 
their  ML  estimators  and  estimating  ly(0)  by  XX' /T.  ■ 


In  this  section,  we  have  chosen  to  consider  the  mean-adjusted  form  of  a 
VAR(p)  process.  Of  course,  it  is  possible  to  perform  a  similar  derivation  for  the 
standard  form  given  in  (3.1.1).  In  that  case  the  ML  estimators  of  v  and  a.  are 
not  asymptotically  independent  though.  Their  joint  asymptotic  distribution  is 
identical  to  that  of  /3  given  in  Proposition  3.1.  From  Proposition  3.2  we  know 
that  the  asymptotic  distribution  of  a  remains  unaltered.  In  the  next  section, 
we  will  investigate  the  consequences  of  forecasting  with  estimated  rather  than 
known  processes. 
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3.5  Forecasting  with  Estimated  Processes 

3.5.1  General  Assumptions  and  Results 

In  Chapter  2,  Section  2.2,  we  have  seen  that  the  optimal  h- step  forecast  of 
the  process  (3.1.1)  is 

yt(h)  =  v  +  Aryt(h  -  1)  H - b  Apyt(h  -  p ),  (3.5.1) 

where  yt(j)  =  Vt+j  for  j  <  0.  If  the  true  coefficients  B  =  (v,  A1: . . .  ,Ap)  are 

replaced  by  estimators  B  =  (V,Ai,..,,  Ap),  we  get  a  forecast 

yt(h)  =  v  +  Apyt(h  -  1)  H - b  Apyt(h  -  p ),  (3.5.2) 

where  yt(j)  =  yt+j  for  j  <  0.  Thus,  the  forecast  error  is 
yt+h-yt{h)  =  [yt+h  -  yt(h)}  +  [yt{h)  -  yt{h)} 

h- 1 

=  ^2  &iut+h-i  +  [ yt(h )  -  yt(h)\ ,  (3.5.3) 

i= 0 

where  the  are  the  coefficient  matrices  of  the  canonical  MA  representation  of 
yt  (see  (2.2.9)).  Under  quite  general  conditions  for  the  process  yt,  the  forecast 
errors  can  be  shown  to  have  zero  mean,  E  [yt+h  —  yt(h)}  =  0,  so  that  the 
forecasts  are  unbiased  even  if  the  coefficients  are  estimated.  Because  we  do 
not  need  this  result  in  the  following,  we  refer  to  Dufour  (1985)  for  the  details 
and  a  proof.  All  the  us  in  the  first  term  on  the  right-hand  side  of  the  last 
equality  sign  in  (3.5.3)  are  attached  to  periods  s  >  t,  whereas  all  the  ys 
in  the  second  term  correspond  to  periods  s  <  t,  if  estimation  is  done  with 
observations  from  periods  up  to  time  t  only.  Therefore,  the  two  terms  are 
uncorrelated.  Hence,  the  MSE  matrix  of  the  forecast  yt.(h)  is  of  the  form 

Zy(h)  :=  MSE  [yt(h)}  =  E{[yt+h  -  yt{h)}[yt+h  -  yt(h)}'} 

=  Zy(h)  +  MSE  [yt(h)  -  yt(h)\ ,  (3.5.4) 

where 


h- 1 

2=0 

(see  (2.2.11)).  In  order  to  evaluate  the  last  term  in  (3.5.4),  the  distribution  of 
the  estimator  B  is  needed.  Because  we  have  not  been  able  to  derive  the  small 
sample  distributions  of  the  estimators  considered  in  the  previous  sections  but 
we  have  derived  the  asymptotic  distributions  instead,  we  cannot  hope  for 
more  than  an  asymptotic  approximation  to  the  MSE  of  yt(h)  —  yt(h).  Such 
an  approximation  will  be  derived  in  the  following. 

There  are  two  alternative  assumptions  that  can  be  made  in  order  to  facil¬ 
itate  the  derivation  of  the  desired  result: 
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(1)  Only  data  up  to  the  forecast  origin  are  used  for  estimation. 

(2)  Estimation  is  done  using  a  realization  (time  series)  of  a  process  that  is 
independent  of  the  process  used  for  prediction  and  has  the  same  stochastic 
structure  (for  instance,  it  is  Gaussian  and  has  the  same  first  and  second 
moments  as  the  process  used  for  prediction). 

The  first  assumption  is  the  more  realistic  one  from  a  practical  point  of  view 
because  estimation  and  forecasting  are  usually  based  on  the  same  data  set. 
In  that  case,  because  the  sample  size  is  assumed  to  go  to  infinity  in  deriving 
asymptotic  results,  either  the  forecast  origin  has  to  go  to  infinity  too  or  it 
has  to  be  assumed  that  more  and  more  data  at  the  beginning  of  the  sample 
become  available.  Because  the  forecast  uses  only  p  vectors  ys  prior  to  the 
forecast  period,  these  variables  will  be  asymptotically  independent  of  the  esti¬ 
mator  B  (they  are  asymptotically  negligible  in  comparison  with  all  the  other 
observations  going  into  the  estimate).  Thus,  asymptotically  the  first  assump¬ 
tion  implies  the  same  results  as  the  second  one.  In  the  following,  for  simplicity, 
the  second  assumption  will  therefore  be  used.  Furthermore,  it  will  be  assumed 
that  for  f3  =  vec  (B)  and  /3  =  vec (B)  we  have 


v/T(3«/3)4aT(0,^).  (3.5.5) 

Samaranayake  &  Hasza  (1988)  and  Basu  &  Sen  Roy  (1986)  give  a  formal  proof 
of  the  result  that  the  MSE  approximation  obtained  in  the  following  remains 
valid  under  assumption  (1)  above. 

With  the  foregoing  assumptions  it  follows  that,  conditional  on  a  particular 
realization  Y,  =  {y't, . . . ,  y't_p+1)'  of  the  process  used  for  prediction, 

Vf[yt(h)  -  yt(h)\Yt]  4a/-  (o,  (3.5.6) 


because  yt(h)  is  a  differentiable  function  of  (3  (see  Appendix  C,  Proposition 
C.15(3)).  Here  T  is  the  sample  size  (time  series  length)  used  for  estimation. 
This  result  suggests  the  approximation  of  MSE[yt(h)  —  yt(h)\  by  Q{h)/T , 
where 


n  (h)  =  e 


dyt{h )  dyt(hy 
d(3'  P  d(3 


In  fact,  for  a  Gaussian  process  yt, 


44  [yt(h)  ~  Vt(h)}  4  Af(0,  fi(h)). 


Hence,  we  get  an  approximation 
Ejj(h)  =  Ey(h )  +  —G(h) 


(3.5.7) 


(3.5.8) 


(3.5.9) 


for  the  MSE  matrix  of  yt{h). 
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From  (3.5.7)  it  is  obvious  that  12(h)  and,  thus,  the  approximate  MSE  £$(h) 
can  be  reduced  by  using  an  estimator  that  is  asymptotically  more  efficient 
than  /3,  if  such  an  estimator  exists.  In  other  words,  efficient  estimation  is  of 
importance  in  order  to  reduce  the  forecast  uncertainty. 


3.5.2  The  Approximate  MSE  Matrix 

To  derive  an  explicit  expression  for  12 (h),  the  derivatives  8yt(h)/80  are 
needed.  They  can  be  obtained  easily  by  noting  that 

Vt(h)  =  JiBhZt,  (3.5.10) 


where  Zt  :=  (1,  y't, . . . ,  y't_p+1)\ 


0  0 
Av-\  Av 
0  0 
0  0 


[  0  0  0  . . .  IK  0 

[(Kp+l)x(Kp+l)] 

and 


1  0  0 
v  A\  A2 
0  Ik  0 
0  0  IK 


1  0 
B 


0  Ik(p-  1) 


0 

0 


Ji 


[  ^  :  IK  :  O^j^O  ] 

(K  x  1)  (KxK(p- 1)) 


[Ax  (Kp+  1)]. 


The  relation  (3.5.10)  follows  by  induction  (see  Problem  3.8).  Using  (3.5.10), 
we  get 


dytjh) 

80 


dvec(JiHhZt)  dvec(Bh) 

-  —  (*>  Ji) 


80 

(Z't  ®  Ji) 

(Z't  ®  Ji) 


B' 


80 

<9vec(B) 


h- 1 

E  (BT"1-*  - 

i= 0 

(Appendix  A. 13,  Rule  (8)) 
h- 1 


80 


E(B') 


B' 


(7/ip+i  ®  J[) 


L  i—0  J 

(see  the  definition  of  B) 


h-l 


i—0 

h-l 


J^zXbT-1 


®  JiB\J[ 


(3.5.11) 
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where  <&i  =  JiW  J[  follows  as  in  (2.1.17).  Using  the  LS  estimator  (3  with 
asymptotic  covariance  matrix  ®  Su  (see  Proposition  3.1),  the 

matrix  L?(/i)  is  seen  to  be 


fl(h)  =  E 

h— 1 h— 1 


dytjh)  1  sdyt(h)' 
df3'  (  ®  u)  dp 


r/f  ^  i  J-1—  lg/l-  1—  7  ry 


jzt)  ®  ^1;$' 


EE^'(b') 

2—0  j  —  0 

EE  £[tr(Zf'(B')'1  1  Zt)]^^ 


l  J 


EEM(B') 


T_1Bf 


*  J 
h— 1 h— 1 


(3.5.12) 


=  E  E  tr[(Bj  1  ir-1Bh-1^r]#iru^'s 

2—0  j  — 0 

provided  is  stable  so  that 

r  :=  plim (ZZ'/T)  =  E(ZtZ't). 

Here  Z  :=  (Zq,  . . . ,  Zt-i)  is  the  (( Kp  +  1)  x  T)  matrix  defined  in  (3.2.1). 
For  example,  for  h  =  1, 


12(1)  =  (Kp+l)Ku. 


Hence,  the  approximation 


EyiX)  —  Eu  + 


Kp  +  1 
T 


E.„  = 


T  +  Kp  +  1 
T 


(3.5.13) 


of  the  MSE  matrix  of  the  1-step  forecast  with  estimated  coefficients  is  ob¬ 
tained.  This  expression  shows  that  the  contribution  of  the  estimation  vari¬ 
ability  to  the  forecast  MSE  matrix  Ag(l)  depends  on  the  dimension  K  of 
the  process,  the  VAR  order  p,  and  the  sample  size  T  used  for  estimation.  It 
can  be  quite  substantial  if  the  sample  size  is  small  or  moderate.  For  instance, 
considering  a  three-dimensional  process  of  order  8  which  is  estimated  from  15 
years  of  quarterly  data  (i.e.,  T  =  52  plus  8  presample  values  needed  for  LS  es¬ 
timation),  the  1-step  forecast  MSE  matrix  Eu  for  known  processes  is  inflated 
by  a  factor  ( T  +  Kp  +  1  )/T  =  1.48.  Of  course,  this  approximation  is  derived 
from  asymptotic  theory  so  that  its  small  sample  validity  is  not  guaranteed.  We 
will  take  a  closer  look  at  this  problem  shortly.  Obviously,  the  inflation  factor 
(T  +  Kp+1)/T  — >  1  for  T  — >  oo.  Thus  the  MSE  contribution  due  to  sampling 
variability  vanishes  if  the  sample  size  gets  large.  This  result  is  a  consequence 
of  estimating  the  VAR  coefficients  consistently.  An  expression  for  f 2(h)  can 
also  be  derived  on  the  basis  of  the  mean-adjusted  form  of  the  VAR  process 
(see  Problem  3.9). 
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In  practice,  for  ft  >  1,  it  will  not  be  possible  to  evaluate  £2 (ft)  without 
knowing  the  AR  coefficients  summarized  in  the  matrix  B.  A  consistent  esti¬ 
mator  i?(ft)  may  be  obtained  by  replacing  all  unknown  parameters  by  their 
LS  estimators,  that  is,  B  is  replaced  by  B  which  is  obtained  by  using  B  for 
B ,  Su  is  replaced  by  Bu ,  <I>,  is  estimated  by  =  J\  B '  ,J[ ,  and  r  is  estimated 
by  r  =  ZZ' IT .  The  resulting  estimator  of  Vg(ft)  will  be  denoted  by  Z’j(ft)  in 
the  following. 

The  foregoing  discussion  is  of  importance  in  setting  up  interval  forecasts. 
Assuming  that  yt  is  Gaussian,  an  approximate  (1  —  a)  100%  interval  forecast, 
ft  periods  ahead,  for  the  ft-tli  component  yk,t  of  yt  is 

%,t{h)  ±  Z(a/2)dk{h)  (3.5.14) 

or 

[yk,t(h)  -  Z(a/2)dk{h),  %,t{h)  +  Z(a/2)Vk(h)^  ,  (3.5.15) 

where  Z(Q)  is  the  upper  alOO-th  percentile  of  the  standard  normal  distribution 
and  <7 k  (ft  )  is  the  square  root  of  the  fc-th  diagonal  element  of  Ag(ft).  Using  Bon- 
ferroni’s  inequality,  approximate  joint  confidence  regions  for  a  set  of  forecasts 
can  be  obtained  just  as  described  in  Section  2.2.3  of  Chapter  2. 

3.5.3  An  Example 

To  illustrate  the  previous  results,  we  consider  again  the  investment/income/ 
consumption  example  of  Section  3.2.3.  Using  the  VAR(2)  model  with  the 
coefficient  estimates  given  in  (3.2.22)  and 

"  .02551  1  [  .03637  " 

Vt- i  =  2/72  =  .02434  and  yT  =  2/73  =  .00517 

_  .01319  J  [  •00599  . 

results  in  forecasts 

"  -•°11 

2/t(  1)  =  *7  +  AyyT  +  A2yT-i  =  -020 

.022 

'  .011  ' 

Vt{  2)  =  v  +  MyT{\)  +  A2yT  =  -020 

.015 

and  so  on. 

The  estimated  forecast  MSE  matrix  for  ft. 

?  _  T  +  Kp+l~  73  +  6  +  1  - 

~  rp  ^ 

'  23.34  .785  1.351  " 

.785  1.505  .674  x  10“ 

1.351  .674  .978 
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where  Su  from  (3.2.23)  has  been  used.  We  need  <P i  for  evaluating 

^(2)  =  4(2)  +  ifl(2), 

where 

Sy{2)  =  SU  +  Eu$[ 

and 

l  l 

12(2)  =  ^^tr[(B  'Y^XZZ' /Ty^-^ZZ'  /T^iEu^ 

2=0  j  =  0 

=  tr[B'(ZZ’)-1BZZ’]Eu  +  tv{B,)Eu$'1 

+tr(B)^i27,u  +  tr(7xp_)_i)<?i27.u^,1. 

From  (2.1.22)  we  know  that  =  A\ .  Hence,  we  use  <Pi  =  A\  from  (3.2.22). 
Thus,  we  get 

'  23.67  .547  1.226  ' 

Sv{ 2)  =  .547  1.488  .554  x  1(T4 

1.226  .554  .952 

and 

"  10.59  .238  .538  ' 

12(2)  =  .238  .675  .233  x  10"3. 

.538  .233  .422 

Consequently, 

'  25.12  .580  1.300  ' 

Z~(2)  =  .580  1.581  .586  x  10"4.  (3.5.18) 

1.300  .586  1.009 

Assuming  that  the  data  are  generated  by  a  Gaussian  process,  we  get  the 
following  approximate  95%  interval  forecasts: 

2/i, t(1)  ±  1.96§i(l)  or  -.011  ±  .095, 

2/2, t(1)  ±  1.96ct2(1)  or  .020  ±  .024, 

y3jT(  1)  ±  1.96<t3(1)  or  .022  ±  .019,  (3.5.19) 

2/i,t(2)  ±  1.96§i(2)  or  .011  ±  .098, 

2/2,t( 2)  ±  1.96§2(2)  or  .020  ±  .025, 
y3,T{ 2)  ±  1.96§3(2)  or  .015  ±  .020. 
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In  Figure  3.3,  some  more  forecasts  of  the  three  variables  with  two-standard 
error  bounds  to  each  side  are  depicted.  The  intervals  indicated  by  the  dashed 
bounds  may  be  interpreted  as  approximate  95%  forecast  intervals  for  the 
individual  forecasts.  If  the  region  enclosed  by  the  dashed  lines  is  viewed  as  a 
joint  confidence  region  for  all  4  forecasts,  a  lower  bound  for  the  (approximate) 
probability  content  is  (100— 4x  5)%  =  80%.  In  the  figure  it  can  be  seen  that  for 
investment  and  income  the  actually  observed  values  for  1979  (f  =  77, ... ,  80) 
are  well  inside  the  forecast  regions,  whereas  two  of  the  four  consumption  values 
are  outside  that  region. 


3.5.4  A  Small  Sample  Investigation 

It  is  not  obvious  that  the  MSE  and  interval  forecast  approximations  derived  in 
the  foregoing  are  reasonable  in  small  samples  because  the  MSE  modification 
has  been  based  on  asymptotic  theory.  To  investigate  the  small  sample  behavior 
of  the  predictor  with  estimated  coefficients,  we  have  used  again  1000  realiza¬ 
tions  of  the  bivariate  VAR(2)  process  (3.2.25)/(3.2.26)  of  Section  3.2.4  and 
we  have  computed  forecast  intervals  for  the  period  following  the  last  sample 
period.  In  Table  3.2,  the  proportions  of  actual  values  falling  in  these  intervals 
are  reported  for  sample  sizes  of  T  =  30  and  100. 


Table  3.2.  Accuracy  of  forecast  intervals  in  small  samples  based  on  1000 
bivariate  time  series 

percent  of  actual  values  falling 
in  the  forecast  interval 


MSE  used 
in  interval 

construction 

%  forecast 
interval 

T  = 

2/i 

=  30 

2/2 

T  = 

2/i 

100 

2/2 

90 

86.5 

85.7 

89.7 

89.4 

S»(l) 

95 

92.6 

91.8 

94.5 

94.0 

99 

98.1 

98.0 

99.0 

98.5 

90 

89.3 

88.2 

90.4 

90.0 

£®(1) 

95 

94.4 

94.1 

95.3 

94.6 

99 

99.0 

98.4 

99.3 

98.8 

90 

85.2 

84.2 

89.6 

88.5 

4(1) 

95 

90.5 

90.4 

94.7 

93.9 

99 

98.4 

96.5 

98.9 

98.3 

90 

88.1 

86.9 

90.3 

89.1 

4(1) 

95 

93.4 

92.7 

95.2 

94.0 

99 

99.4 

97.8 

99.1 

98.5 
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investment 

c\j 


forecast  - 95%  confidence  bound 


income 


-  forecast  - 95%  confidence  bound 


forecast  - 95%  confidence  bound 


Fig.  3.3.  Forecasts  of  the  investment/income/consumption  system. 


Obviously,  for  T  =  30,  the  theoretical  and  actual  percentages  are  in 
best  agreement  if  the  approximate  MSEs  £y{h)  are  used  in  setting  up 
the  forecast  intervals.  On  the  other  hand,  only  forecast  intervals  based  on 
Sy{h)  =  and  Sy(h )  are  feasible  in  practice  when  the  actual 

process  coefficients  are  unknown  and  have  to  be  estimated.  Comparing  only 
the  results  based  on  these  two  MSE  matrices  shows  that  it  pays  to  use  the 
asymptotic  approximation  S^{h). 
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In  Table  3.2,  we  also  give  the  corresponding  results  for  T  =  100.  Because 
the  estimation  uncertainty  decreases  with  increasing  sample  size,  one  would 
expect  that  now  the  theoretical  and  actual  percentages  are  in  good  agreement 
for  all  MSEs.  This  is  precisely  what  can  be  observed  in  the  table.  Nevertheless, 
even  now  the  use  of  the  MSE  adjustment  in  27g(l)  gives  slightly  more  accurate 
interval  forecasts. 

3.6  Testing  for  Causality 

3.6.1  A  Wald  Test  for  Granger-Causality 

In  Chapter  2,  Section  2.3.1,  we  have  partitioned  the  VAR(p)  process  yt  in 
subprocesses  zt  and  xt,  that  is,  y't  =  ( z't,x't )  and  we  have  defined  Granger- 
causality  from  xt  to  zt  and  vice  versa.  We  have  seen  that  this  type  of  causality 
can  be  characterized  by  specific  zero  constraints  on  the  VAR  coefficients  (see 
Corollary  2.2.1).  Thus,  in  an  estimated  VAR(p)  system,  if  we  want  to  test 
for  Granger-causality,  we  need  to  test  zero  constraints  for  the  coefficients. 
Given  the  results  of  Sections  3.2,  3.3,  and  3.4  it  is  straightforward  to  derive 
asymptotic  tests  of  such  constraints. 

More  generally  we  consider  testing 

H0  ■  Cj3  =  c  against  Hi  :  C/3  /  c,  (3.6.1) 

where  C  is  an  ( N  x  (K2p+  K))  matrix  of  rank  N  and  c  is  an  ( N  x  1)  vector. 
Assuming  that 

\/T(p-p)±N{Q,r-x®Eu)  (3.6.2) 

as  in  LS/ML  estimation,  we  get 

Vt{C/3  -  C/3)  4  A/"  [0,  C/r-1  <g>  SU)C'\  (3.6.3) 

(see  Appendix  C,  Proposition  C.15)  and,  hence, 

T(CP  -  c)'  [C/r-1  %  EU)C'}  -1  (Cj9  -  c)  4  x2(A).  (3.6.4) 

This  statistic  is  the  Wald  statistic  (see  Appendix  C.7). 

Replacing  T  and  ZJU  by  their  usual  estimators  T  =  ZZ' /T  and  ZJU  as 
given  in  (3.2.19),  the  resulting  statistic 

\w  =  {C'f3-c)'\c{{ZZ')-1®Zu)C'}  1{CP-c)  (3.6.5) 

still  has  an  asymptotic  ^-distribution  with  N  degrees  of  freedom,  provided 
yt  satisfies  the  conditions  of  Proposition  3.2,  because  under  these  conditions 
[C//ZZ')-1  <g)  I7U)C"]-1/T  is  a  consistent  estimator  of  [C(.r-1  ®  AU)C"]  1 . 
Hence,  we  have  the  following  result. 
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Proposition  3.5  ( Asymptotic  Distribution  of  the  Wald  Statistic) 

Suppose  (3.6.2)  holds.  Furthermore,  plim(ZZ'/T)  =  P,  plim  Su  =  Su  are 
both  nonsingular  and  H0  :  Cf3  =  c  is  true,  with  C  being  an  (N  x  ( K2p  +  K)) 
matrix  of  rank  N.  Then 

=  (CP  -  cyiCdZZ')-1  ®  ZU)C']-\CP  -  c)  4  X\N). 


In  practice,  it  may  be  useful  to  make  adjustments  to  the  statistic  or  the 
critical  values  of  the  test  to  compensate  for  the  fact  that  the  matrix  F_1  ®  Eu 
is  unknown  and  has  been  replaced  by  an  estimator.  Working  in  that  direction, 
we  note  that 

NF(N,  T)  X2(A0,  (3.6.6) 

T  —>  oo 

where  F(N,  T)  denotes  an  F  random  variable  with  N  and  T  degrees  of  freedom 
(d.f.)  (Appendix  C,  Proposition  C.3).  Because  an  F(N,  ^-distribution  has  a 
fatter  tail  than  the  x2(./V)-distribution  divided  by  N,  it  seems  reasonable  to 
consider  the  test  statistic 

A  F  =  A  W/N  (3.6.7) 

in  conjunction  with  critical  values  from  some  F-distribution.  The  question  is 
then  what  numbers  of  degrees  of  freedom  should  be  used?  From  the  foregoing 
discussion  it  is  plausible  to  use  N  as  the  numerator  degrees  of  freedom.  On  the 
other  hand,  any  sequence  that  goes  to  infinity  with  the  sample  size  qualifies 
as  a  candidate  for  the  denominator  d.f.  The  usual  F-statistic  for  a  regression 
model  with  nonstochastic  regressors  has  denominator  d.f.  equal  to  the  sample 
size  minus  the  number  of  estimated  parameters.  Therefore  we  may  use  this 
number  here  too.  Note  that,  in  the  model  (3.2.3),  we  have  a  vector  y  with 
KT  observations  and  /3  contains  K(Kp+ 1)  parameters.  Alternatively,  we  will 
argue  shortly  that  T—  Kp—  1  is  also  a  reasonable  number  for  the  denominator 
d.f.  Hence,  we  have  the  approximate  distributions 

\F  «  F(N,  KT  -  K2p  -  K)  «  F{N,  T  -  Kp  —  1).  (3.6.8) 

3.6.2  An  Example 

To  see  how  this  result  can  be  used  in  a  test  for  Granger-causality,  let  us 
consider  again  our  example  system  from  Section  3.2.3.  The  null  hypothesis 
of  no  Granger-causality  from  income /consumption  (j/2,2/3)  to  investment  (yi) 
may  be  expressed  in  terms  of  the  coefficients  of  the  VAR(2)  process  as 

Ho  :  ai2,i  =  ai3,i  =  <212,2  =  <213,2  =  0.  (3.6.9) 

This  null  hypothesis  may  be  written  as  in  (3.6.1)  by  defining  the  (4x  1)  vector 
c  =  0  and  the  (4  x  21)  matrix 
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C  = 


0  0  0 
0  0  0 
0  0  0 
0  0  0 


0  0  0 
0  0  0 
0  0  0 
0  0  0 


10  0 
0  0  0 
0  0  0 
0  0  0 


0  0  0 
10  0 
0  0  0 
0  0  0 


0  0  0 
0  0  0 
0  0  0 
0  0  0 


0  0  0 
0  0  0 
10  0 
0  0  0 


0  0  0 
0  0  0 
0  0  0 
10  0 


With  this  notation,  using  the  estimation  results  from  Section  3.2.3, 


\F  =  3 'c'  fcazz')-1  ®  EU)C' 


(73/4=  1.59. 


(3.6.10) 


In  contrast,  the  95th  percentile  of  the  F(4, 3  •  73  —  9  •  2  —  3)  =  F( 4, 198)  « 
F(4,  73  -3-2  —  1)  =  F(4, 66)-distribution  is  about  2.5.  Thus,  in  a  5%  level 
test,  we  cannot  reject  Granger-noncausality  from  income/consumption  to  in¬ 
vestment. 

In  this  example,  the  denominator  d.f.  are  so  large  (namely  198  or  66)  that 
we  could  just  as  well  use  X\y  in  conjunction  with  a  critical  value  from  a  y2(4)- 
distribution.  The  95th  percentile  of  that  distribution  is  9.49  and,  thus,  it  is 
about  four  times  that  of  the  F-test  while  Aw  =  4Af- 

In  an  example  of  this  type  it  is  quite  reasonable  to  use  T  —  Kp  —  1  denomi¬ 
nator  d.f.  for  the  F-test  because  all  the  restrictions  are  imposed  on  coefficients 
from  one  equation.  Therefore  XF  actually  reduces  to  an  F-statistic  related  to 
one  equation  with  Kp  +  1  parameters  which  are  estimated  from  T  observa¬ 
tions.  The  use  of  T  —  Kp  —  1  d.f.  may  also  be  justified  by  arguments  that 
do  not  rely  on  the  restrictions  being  imposed  on  the  parameters  of  one  equa¬ 
tion  only,  namely  by  appealing  to  the  similarity  between  the  XF  statistic  and 
Hotelling’s  T2  (e.g.,  Anderson  (1984)). 

Many  other  tests  for  Granger-causality  have  been  proposed  and  investi¬ 
gated  (see,  e.g.,  Geweke,  Meese  &  Dent  (1983)).  In  the  next  chapter,  we  will 
return  to  the  testing  of  hypotheses  and  then  an  alternative  test  will  be  con¬ 
sidered. 


3.6.3  Testing  for  Instantaneous  Causality 

Tests  for  instantaneous  causality  can  be  developed  in  the  same  way  as  tests  for 
Granger-causality  because  instantaneous  causality  can  be  expressed  in  terms 
of  zero  restrictions  for  er  =  vech(A7„)  (see  Proposition  2.3).  If  yt  is  a  stable 
Gaussian  VAR(p)  process  and  we  wish  to  test 

H0  :  Cu  =  0  against  H\  :  Ccr  ^  0,  (3.6.11) 

we  may  use  the  asymptotic  distribution  of  the  ML  estimator  given  in  Propo¬ 
sition  3.4  to  set  up  the  Wald  statistic 

Aw  =  Tct  <7'[2<7D+  (Su  0  SjD+'C'j^Ca-,  (3.6.12) 

where  Dj-  is  the  Moore-Penrose  inverse  of  the  duplication  matrix  DF  and  C 
is  an  (N  x  K(K  +  l)/2)  matrix  of  rank  N.  Under  Hq,  X w  has  an  asymptotic 
^-distribution  with  N  degrees  of  freedom. 
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Alternatively,  a  Wald  test  of  (3.6.11)  could  be  based  on  the  lower  triangular 
matrix  P  which  is  obtained  from  a  Choleski  decomposition  of  Eu.  Noting  that 
instantaneous  noncausality  implies  zero  elements  of  Su  that  correspond  to 
zero  elements  of  P,  we  can  write  Ho  from  (3.6.11)  equivalently  as 

H0  :  Gvech(P)  =  0.  (3.6.13) 

Because  vech(P)  is  a  continuously  differentiable  function  of  er,  the  asymptotic 
distribution  of  the  estimator  P  obtained  from  decomposing  Eu  follows  from 
Proposition  C.15(3)  of  Appendix  C: 

v/Tvech(P-P)4Af(0,PA5.f?'),  (3.6.14) 

where 

H  =  — — -  =  [Lk(Ik-2  +  K kk){P  ®  Ik )La']_1 

(see  Appendix  A. 13,  Rule  (10)).  Here  Kmn  is  the  commutation  matrix  defined 
such  that  vec(G)  =  Km.„vec(G')  for  any  (to  x  n)  matrix  G  and  LK  is  the 
(^K(K  + 1)  x  K 2)  elimination  matrix  defined  such  that  vech(P)  =  L k  vec(P) 
for  any  (. K  x  K )  matrix  F  (see  Appendix  A. 12. 2).  A  Wald  test  of  (3.6.13)  may 
therefore  be  based  on 

Aw  =  Tvech(P),G,[GPA^P'G,]"1Gvech(P)  4x2(7V),  (3.6.15) 

where  hats  denote  the  usual  estimators.  Although  the  two  tests  based  on  <r 
and  P  are  derived  from  the  same  asymptotic  distribution,  they  may  differ  in 
small  samples.  Of  course,  in  the  previous  discussion  we  may  replace  Eu  by 
the  asymptotically  equivalent  estimator  Hu. 

In  our  investment/income/consumption  example,  suppose  we  wish  to  test 
for  instantaneous  causality  between  (income,  consumption)  and  investment. 
Following  Proposition  2.3,  the  null  hypothesis  of  no  causality  is 

Ho  '■  021  =  0"3i  =  0  or  Ccr  =  0, 

where  try  is  a  typical  element  of  Eu  and 

[  0  1  0  0  0  0' 

G_[o  0  1  0  0  0  ' 

For  this  hypothesis,  the  test  statistic  in  (3.6.12)  assumes  the  value  A w  =  5.46. 
Alternatively,  we  may  test 

H0  :  p2i  =  £>3i  =  0  or  Gvech(P)  =  0, 

where  p,^  is  a  typical  element  of  P.  The  corresponding  value  of  the  test  statis¬ 
tic  from  (3.6.15)  is  Aw  =  5.70.  Both  tests  are  based  on  asymptotic  X2(2)~ 
distributions  and  therefore  do  not  reject  the  null  hypothesis  of  no  instanta¬ 
neous  causality  at  a  5%  level.  Note  that  the  critical  value  for  a  5%  level  test 
is  5.99. 
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3.6.4  Testing  for  Multi-Step  Causality 

In  Section  2.3.1,  we  have  also  discussed  the  possibility  of  extending  the  in¬ 
formation  set  and  considering  causality  between  two  variables  in  a  system 
that  includes  further  variables.  Using  the  same  ideas  as  in  the  definition  of 
Granger-causality  resulted  in  the  definition  of  h- step  causality.  This  concept 
implies  nonlinear  restrictions  for  the  VAR  coefficients  for  which  the  usual  ap¬ 
plication  of  the  Wald  principle  does  not  result  in  a  valid  test.  The  following 
example  from  Liitkepohl  &  Burda  (1997)  illustrates  the  problem. 

Consider  a  three-dimensional  VAR(l)  process: 

Zf  OCzz  OCZy  CXzx  Zt  —  1 

Vt  —  ^yz  Q-yy  ^ yx  Vt— 1  ^y,t  ■  (3.6.16) 

&xx  —  1  ^ x,t 

From  (2.3.24)  we  know  that  a  test  of  oo-step  noncausality  from  yt  to  zt 
{yt1^{oo)zt)  needs  to  check  h  =  2  restrictions  on  the  VAR  coefficient  vector. 
They  are  of  the  following  nonlinear  form: 

where 

R=  [0  0  0  1  0  0  0  0  0], 

a  =  vec(Ai)  and  a ^  =  vec(A2),  with  A\  being  the  coefficient  matrix  of  the 
process  in  (3.6.16).  Hence, 

aZy  _  0 

O^zz^zy  O^zy^yy  d"  azxaxy  0 

Denoting  the  covariance  matrix  of  the  asymptotic  distribution  of  \/T{ a  —  cc) 
as  usual  by  £&  and  a  consistent  estimator  by  Ta,  the  Wald  statistic  for  testing 
these  restrictions  has  the  form 

A«'  =  rr(S)'(V£-|L)  r(s), 

where  dr/da'  is  an  estimator  of  dr/da'  (see  Appendix  C.7).  The  statistic 
has  an  asymptotic  %2(2)-distribution  under  the  null  hypothesis,  provided  the 
matrix 

dr  dr' 
dot'  “  da 

is  nonsingular.  In  the  present  case,  the  latter  condition  is  unfortunately  not 
satisfied  for  all  relevant  parameter  values. 


(3.6.17) 
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To  see  this,  note  that  the  matrix  of  first  order  partial  derivatives  of  the 
function  r(cc)  is 

jh^rooo  1  0  0  000' 

da'  OLzy  0  0a  22  H-  OLyy  OiZy  O' ZX  OtXy  0  0  J 

The  restrictions  (3.6.17)  are  satisfied  if 

OLzy  —  OCzx  0,  OLxy  7^  0,  (3.6.18) 

or 

OLZy  —  OLXy  —  0,  OLzx  0,  (3.6.19) 

or 


if  (3.6.20)  holds.  Hence,  the  standard  Wald  statistic  will  not  have  its  asymp¬ 
totic  x2(2)-distribution  under  the  null  hypothesis  r(a)  =  0  if  (3.6.20)  holds. 

Liitkepohl  &  Burda  (1997)  discussed  a  possibility  to  circumvent  the  prob¬ 
lem  by  simply  drawing  a  random  variable  from  a  normal  distribution  and 
adding  it  to  the  second  restriction.  Thereby  a  nonsingular  distribution  of  the 
modified  restriction  vector  is  obtained  and  a  Wald  type  statistic  can  be  con¬ 
structed  for  this  vector. 

More  generally,  Liitkepohl  &  Burda  (1997)  proposed  the  following  ap¬ 
proach  for  testing  the  null  hypothesis  that  the  Ay-dimensional  vector  yt  is 
not  h-step  causal  for  the  Kz  -  d  i  me  ns  i  on  al  vector  Zt  (yt7^(h)zt)  if  additional 
Kx  variables  xt  are  present  in  the  system  of  interest.  Using  the  notation  from 
Section  2.3.1,  that  is,  A  is  defined  as  in  the  VAR(l)  representation  (2.1.8), 
J  :=  [Ik  :  0  :  •  •  •  :  0]  is  a  ( K xKp)  matrix,  A&)  :=  JAJ,  and  a ^  :=  vec(AiJi), 
the  hypotheses  of  interest  can  be  stated  as 

Hq  :  (lh  (g>  R )  =  0  against  Hi  :  ( lh  ®  R)  0, 

where  R  is  a  ( pKzKy  x  pK2)  matrix,  as  defined  in  (2.3.23),  and 


(3.6.21) 
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Let  be  the  estimator  corresponding  to  a^)  based  on  the  multivariate  LS 
estimator  3  of  a.  Furthermore,  we  denote  by  diag(-D)  a  diagonal  matrix  which 
has  the  diagonal  elements  of  the  square  matrix  D  on  its  main  diagonal  and 
define  the  ( hpKzKy  x  hpKzKy )  matrix 

0  0 

0  Ih~i  ®  &\&g(REscR') 

Moreover,  we  define  a  random  vector  w ^  ~  A/”(0,A Ew(h))  which  is  drawn 
independently  of  3.  Here  A  >  0  is  some  fixed  real  number.  Liitkepohl  & 
Burda  (1997)  defined  the  following  modified  Wald  statistic  for  testing  the 
pair  of  hypotheses  in  (3.6.21): 

(h)\  ' 

{Ih®R)  + 

x  (. Ih®R)Zli(h)(Ifl®R')  +  \Ew(h ) 

(Ih®R)a^  +  *^1  . 

Here  Ek{h)  is  a  consistent  estimator  of  the  asymptotic  covariance  matrix  of 
VT( a^  —  a^)).  It  can  be  shown  that 

Ard  x\hPKzKv) 

under  H0.  Notice  that  there  is  no  need  to  add  anything  to  the  first  pKzKy 
components  of  (/^  <S>  R) because  they  are  equal  to  Ra  which  has  a  non¬ 
singular  asymptotic  distribution. 

Clearly,  adding  some  random  term  to  atk)  reduces  the  efficiency  of  the 
procedure  and  is  likely  to  result  in  a  loss  in  power  of  the  test  relative  to  a 
procedure  which  does  not  use  this  device.  In  particular,  if  the  noise  term  is 
substantial  in  relation  to  the  estimated  variance,  there  may  be  some  loss  in 
power.  Therefore,  the  amount  of  noise  (the  variance  of  the  noise)  is  linked  to 
the  variance  of  the  estimator  through  EW(K).  Moreover,  the  quantity  A  may 
be  chosen  close  to  zero.  Thereby  the  loss  in  efficiency  can  be  made  arbitrarily 
small. 

There  are  in  fact  also  other  possibilities  to  avoid  the  problems  related  to 
the  Wald  test.  One  way  to  get  around  it  is  to  impose  zero  restrictions  directly 
on  the  VAR  coefficients  prior  to  analyzing  multi-step  causality.  The  relevant 
subset  models  will  be  discussed  in  Chapter  5. 
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3.7  The  Asymptotic  Distributions  of  Impulse  Responses 
and  Forecast  Error  Variance  Decompositions 

3.7.1  The  Main  Results 

In  Chapter  2,  Section  2.3.2,  we  have  seen  that  the  coefficients  of  the  MA 
representations 

OO 

yt  =  n  +  =  Ik,  (3.7.1) 

2= o 

and 

OO 

yt  =  fjj  +  OiWt-i  (3.7.2) 

2=0 

are  sometimes  interpreted  as  impulse  responses  or  dynamic  multipliers  of  the 
system  of  variables  yt.  Here  p  =  E(yt),  the  O,  =  $ tP ,  wt  =  P~xut,  and  P 
is  the  lower  triangular  Choleski  decomposition  of  Pu  such  that  Pu  =  PP' . 
Hence,  Ew  =  E{wtw't)  =  Ik-  In  this  section,  we  will  assume  that  the  tfVs  and 
Oi  s  are  unknown  and  they  are  computed  from  the  estimated  VAR  coefficients 
and  error  covariance  matrix.  We  will  derive  the  asymptotic  distributions  of  the 
resulting  estimated  s  and  Oi  s.  In  these  derivations,  we  will  not  need  the 
existence  of  MA  representations  (3.7.1)  and  (3.7.2).  We  will  just  assume  that 
the  <&i  s  are  obtained  from  given  coefficient  matrices  Ai, . . . ,  A.p  by  recursions 

i 

<!>;  ,.1;.  *=1,2,..., 

1=1 

starting  with  =  h\  and  setting  Aj  =  0  for  j  >  p.  Furthermore,  the  Oi  s 
are  obtained  from  Ai, . . . ,  Ap,  and  Eu  as  Oi  =  <£,P,  where  P  is  as  specified  in 
the  foregoing.  In  addition,  the  asymptotic  distributions  of  the  corresponding 
accumulated  responses 


n  oo 


£**> 

^oo  =  £  =  Uk  -Ax--- 

■  —  Ap)  1  (if  it  exists), 

2=0 

2  =  0 

X>. 

2=0 

“oo  =  ^  Oi  =  ( Ik  —  Ai  —  • 
2=0 

•  •  —  Ap)~1P  (if  it  exists), 

and  the  forecast  error  variance  components, 
h- 1 

Ujk,h  =  5>'0,:efe)2/MSE  .(h),  (3.7.3) 

2=0 

will  be  given.  Here  is  the  k- th  column  of  1&  and 
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h- 1 

MSE j(h)  =  J2 
i= o 

is  the  j- th  diagonal  element  of  the  MSE  matrix  Sy(h)  of  an  h- step  forecast 
(see  Chapter  2,  Section  2.2.2). 

The  derivation  of  the  asymptotic  distributions  is  based  on  the  following 
result  from  Appendix  C,  Proposition  C.15(3).  Suppose  (3  is  an  (n  x  1)  vector 
of  parameters  and  (3  is  an  estimator  such  that 


Vf@-/3)±M(0,Efi), 


where  T,  as  usual,  denotes  the  sample  size  (time  series  length)  used  for  es¬ 
timation.  Let  g{(3)  be  a  continuously  differentiable  function  with  values  in 
the  m-dimensional  Euclidean  space  and  suppose  that  dgi/df3'  =  ( dgi/d/3j )  is 
nonzero  at  the  true  vector  (3,  for  i  =  1, . . . ,  to.  Then, 


Vf 


5(/3)-s(/3) 


n  dg_y  (¥_ 

’  df3'  *  8(3 


In  writing  down  the  asymptotic  distributions  formally,  we  use  the  notation 


a.  := 


A  := 


vec(Ai, . . . ,  Ap) 

A\  A2  ...  1  Av 

I K  0  . . .  0  0 
0  IK  0  0 


0  0 
er  :=  vech(Z'11) 


lk 


(K2p  x  1), 


( Kp  x  Kp), 


(hK(K+  1)  x  1) 


and  the  corresponding  estimators  are  furnished  with  a  hat.  As  before,  vec 
denotes  the  column  stacking  operator  and  vech  is  the  corresponding  operator 
that  stacks  the  elements  on  and  below  the  main  diagonal  only.  We  also  use 
the  commutation  matrix  Kmn,  defined  such  that,  for  any  (m  x  n)  matrix  G, 
Kmnvec(G)  =  vec(G'),  the  ( m 2  x  ^m(m+ 1))  duplication  matrix  Dm,  defined 
such  that  Dmvech(T)  =  vec(T),  for  any  symmetric  (m  x  to)  matrix  F ,  and 
the  (|?n(m.  +  1)  x  in 2)  elimination  matrix  Lm,  defined  such  that,  for  any 
(to-  x  to)  matrix  F,  vech(F)  =  Lmvec(F)  (see  Appendix  A. 12.2).  Furthermore, 
J  :=  [1K  :  0  :  •  •  •  :  0]  is  a  ( K  x  Kp)  matrix.  With  this  notation,  the  following 
proposition  from  Liitkepohl  (1990)  can  be  stated. 


Proposition  3.6  ( Asymptotic  Distributions  of  Impulse  Responses) 
Suppose 


Vf 


dt  —  ct 
a  —  <j 


A*a :  0  \ 

0  J- 


(3.7.4) 


Then 
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Vfyec($i  -  $0  4  AT(0,  G^aG'),  *  =  1, 2 . 

where 


i-1 


G,  :=  av;c(j>')  =  2  J(A')-- 


da' 


r*.=0 


Vfvec0n  -  Wn)  4aT(0 ,F„raK),  n  =  1,2,..., 

where  bn  i=  Gi  -h  •  •  •  H-  Gn. 

If  (7/c  —  Ai - —  Ap)  is  nonsingular, 

v4  vec(^00  -  ^oo)  4.Af(0,P[X,Z'a^4)> 
where  Px  :=  (S^,, . . . ,  <idj  <g> 

p  times 


(3.7.5) 


(3.7.6) 


(3.7.7) 


%/r  vec(<9,;  -  0j)  4  A/"(0,  CiE^C[  +  CtE^C{),  i  =  0,1,2,...,  (3.7.8) 

where 

G0  :=  0,  Gj  :=  (P'  ®  ^)G;,  *  =  1,2,...,  C\  :=  (7*  ®  <£;)#,  *  =  0, 1, . . . , 

and 

:=  ^4^  =L^{Lif[(/K®JP)K^  +  (P®/K)]L,if}-1 
=  LyLA'^+K^fP®^)^}-1. 

4r  vec(§„  -  S„)  4  A7(0,  +  BnE^B'n),  (3.7.9) 

where  Pn  :=  (P'  ®  7K)P’Tt  and  P„  :=  (. 1K  ®  Wn)H. 

If  {Ik  —  Ai  —  ■  ■  ■  —  Ap)  is  nonsingular, 

Vf  vec(H00  -  -«,)  4  A7(0,  B^SzB'^  +  B^B'^),  (3.7.10) 

where  B <*,  :=  (P'  ®  //f)P!X)  and  :=  (IK  (g>  ^X)H. 

Finally, 

v4(cUja,/i  Mjk,h)  *  A/" ( 0 ,  dy a . a, Pa djk_ h  F 

j,k=l,...,K,  h=  1,2,...,  (3.7.11) 

where 

2  71-1  r 

==  MSE-maE  [MSEj(/i)(e'<I>4Pefc)(e/feP/®e')GI 

•? '  '  i=0 

h-1 

-(e'<Z>,Pefe)2  ^  (e;#mP„  «  e')Gm 

m— 0 
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with  Go  :=  0  and 
h- 1 

djk,h  ■=  JI[2MSE  j(/i)(e'<PiPefc)(e'fe®e'<fi)ff 


2=0 


h- 1 


-(e'^:Pefc)2  ^  {e-'j'&m  ®  e'^m)BK 


m— 0 


MSE^/i)5 


In  the  next  subsection,  the  proof  of  the  proposition  is  indicated.  Some 
remarks  are  worthwhile  now. 

Remark  1  In  the  proposition,  some  matrices  of  partial  derivatives  may  be 
zero.  For  instance,  if  a  VAR(l)  model  is  fitted  although  the  true  order  is  zero, 
that  is,  yt  is  white  noise,  then 

G2  —  t/ A  ®  -f  ^1  =  0 

because  A  =  A\  =  0  and  @1  =  Ai  =  0.  Hence,  a  degenerate  asymptotic 
distribution  with  zero  covariance  matrix  is  obtained  for  \/T  vec(<?2  —  ^2)-  As 
explained  in  Appendix  B,  we  call  such  a  distribution  also  multivariate  normal. 
Otherwise  it  would  be  necessary  to  distinguish  between  cases  with  zero  and 
nonzero  partial  derivatives  or  we  have  to  assume  that  all  partial  derivatives 
are  such  that  the  covariance  matrices  have  no  zeros  on  the  diagonal.  Note 
that  estimators  of  the  covariance  matrices  obtained  by  replacing  unknown 
quantities  by  their  usual  estimators  may  be  problematic  when  the  asymp¬ 
totic  distribution  is  degenerate.  In  that  case,  the  usual  f-ratios  and  confidence 
intervals  may  not  be  appropriate. 

To  illustrate  the  potential  problems  resulting  from  a  degenerate  asymptotic 
distribution,  we  follow  Benkwitz,  Liitkepohl  &  Neumann  (2000)  and  consider 
a  univariate  AR(1)  process  yt  =  otyt- 1  +  ut.  In  this  case,  =  a1.  Suppose 
that  a  is  an  estimator  of  a  satisfying  \/T(a  —  a)  —>  Af(0,  cr|)  with  cr|  ^  0. 
For  instance,  a  may  be  the  LS  estimator  of  a.  Then 

Vf{a2^a2)^U{0  ,<rga) 

with  (j|2  =  4a2 a|.  This  quantity  is,  of  course,  zero  if  a  =  0.  In  the  latter 
case,  \jTal  erg  has  an  asymptotic  standard  normal  distribution  and,  hence, 
TS2/cr|  has  an  asymptotic  %2(l)-distribution.  Thus,  it  is  clear  that  in  this 
case  VTa2  is  asymptotically  degenerate. 

Because  the  estimated  obtained  by  replacing  a  and  <r|  by  their  usual 
LS  estimators  is  nonzero  almost  surely,  it  is  tempting  to  use  the  quantity 
VT(a2  —  a2) /2dtda  for  constructing  a  confidence  interval,  say,  for  However, 
for  a  =  0,  the  f-ratio  becomes  VTa /2a a  which  converges  to  A/(0, 1/4)  asymp¬ 
totically,  because  VTa/a a  V  Af(0,l).  A  confidence  interval  constructed  on 
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the  basis  of  the  asymptotic  standard  normal  distribution  would  therefore  be 
a  conservative  one.  In  other  words,  asymptotic  inference  which  ignores  the 
possible  singularity  in  the  asymptotic  distribution  of  the  impulse  responses 
may  be  misleading  (see  Benkwitz  et  al.  (2000)  for  further  discussion).  ■ 

Remark  2  In  the  proposition,  it  is  not  explicitly  assumed  that  ijt  is  stable. 
While  the  stability  condition  is  partly  introduced  in  (3.7.7)  and  (3.7.10)  by 
requiring  that  (Ik  —  Ai  —  •  •  •  —  Ap)  be  nonsingular  so  that 

det (Ik  —  Aiz  —  •  •  •  —  Apzp)  ^  0  for  z  =  1, 

it  is  not  needed  for  the  other  results  to  hold.  The  crucial  condition  is  the 
asymptotic  distribution  of  the  process  parameters  in  (3.7.4).  Although  we 
have  used  the  stationarity  and  stability  assumptions  in  Sections  3. 2-3.4  in 
order  to  derive  the  asymptotic  distribution  of  the  process  parameters,  we  will 
see  in  later  chapters  that  asymptotic  normality  is  also  obtained  for  certain 
nonstationary,  unstable  processes.  Therefore,  at  least  parts  of  Proposition  3.6 
will  be  useful  in  a  nonstationary  environment.  ■ 

Remark  3  The  block-diagonal  structure  of  the  covariance  matrix  of  the 
asymptotic  distribution  in  (3.7.4)  is  in  no  way  essential  for  the  asymptotic 
normality  of  the  impulse  responses.  In  fact,  the  asymptotic  distributions  in 
(3.7.5)-(3.7.7)  remain  unchanged  if  the  asymptotic  covariance  matrix  of  the 
parameter  estimators  is  not  block-diagonal.  On  the  other  hand,  without  the 
block-diagonal  structure,  the  simple  additive  structure  of  the  asymptotic  co- 
variance  matrices  in  (3.7.8)-(3.7.11)  is  lost.  Although  these  asymptotic  distri¬ 
butions  are  easily  generalizable  to  the  case  of  a  general  asymptotic  covariance 
matrix  of  the  VAR  coefficients  in  (3.7.4),  we  have  not  stated  the  more  general 
result  here  because  it  is  not  needed  in  subsequent  chapters  of  this  text.  ■ 

Remark  4  Under  the  conditions  of  Proposition  3.4,  the  covariance  matrix  of 
the  asymptotic  distribution  of  the  parameters  has  precisely  the  block-diagonal 
structure  assumed  in  (3.7.4)  with 

Ag  =  IV  (0)  ®  Au 

and 

Ag  =  2D^.(AU  ®  AU)D+  , 

where  =  (D)fDx)_1D'K  is  the  Moore-Penrose  inverse  of  the  duplication 
matrix  D k-  Using  these  expressions  in  the  proposition,  some  simplifications  of 
the  covariance  matrices  can  be  obtained.  For  instance,  the  covariance  matrix 
in  (3.7.5)  becomes 

GjAgG' 
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i —  1 

]T  <g>  <P 

jri— 0 


(ry(o)-1®  su) 


J(A  ®  <?„ 

_n— 0 


i— 1  i-1 

m= 0  n=0 

which  is  computationally  convenient  because  all  matrices  involved  are  of  a 
relatively  small  size.  The  advantage  of  the  general  formulation  is  that  it  can 
be  used  with  other  matrices  as  well.  We  will  see  examples  in  subsequent 
chapters.  ■ 


Remark  5  In  practice,  to  use  the  asymptotic  distributions  for  inference,  the 
unknown  quantities  in  the  covariance  matrices  in  Proposition  3.6  may  be 
replaced  by  their  usual  estimators  given  in  Sections  3. 2-3. 4  for  the  case  of  a 
stationary,  stable  process  yt  (see,  however,  Remark  1).  ■ 

Remark  6  Summing  the  forecast  error  variance  components  over  k, 

K  K 

^  '  UJjk.h  —  y  \  —  1 

k=  1  fc=  1 

for  each  j  and  h.  These  restrictions  are  not  taken  into  account  in  the  derivation 
of  the  asymptotic  distributions  in  (3.7.11).  It  is  easily  checked,  however,  that 
for  dimension  K  =  1  the  standard  errors  obtained  from  Proposition  3.6  are 
zero  as  they  should  be,  because  all  forecast  error  variance  components  are  1 
in  that  case.  A  problem  in  this  context  is  that  the  asymptotic  distribution  of 
£>jk,h  cannot  be  used  in  the  usual  way  for  tests  of  significance  and  setting  up 
confidence  intervals  if  uijk,h  =  0.  In  that  case,  from  the  definitions  of  djk,h 
and  djk,h ,  the  variance  of  the  asymptotic  distribution  is  easily  seen  to  be 
zero  and,  hence,  estimating  this  quantity  by  replacing  unknown  parameters 
by  their  usual  estimators  may  lead  to  f-ratios  that  are  not  standard  normal 
asymptotically  and,  hence,  cannot  be  used  in  the  usual  way  for  inference  (see 
Remark  1).  This  state  of  affairs  is  unfortunate  from  a  practical  point  of  view 
because  testing  the  significance  of  forecast  error  variance  components  is  of 
particular  interest  in  practice.  Note,  however,  that 

^Jjk,h  6  s  djk,i  —  6  for  i  —  0, . . . ,  h. 

A  test  of  the  latter  hypothesis  may  be  possible.  ■ 

Remark  7  Joint  confidence  regions  and  test  statistics  for  testing  hypotheses 
that  involve  several  of  the  response  coefficients  can  be  obtained  from  Propo¬ 
sition  3.6  in  the  usual  way.  However,  it  has  to  be  taken  into  account  that,  for 
instance,  the  elements  of  <Pi  and  will  not  be  independent  asymptotically.  If 
elements  from  two  or  more  MA  matrices  are  involved  the  joint  distribution  of 
all  the  matrices  must  be  determined.  This  distribution  can  be  derived  easily 
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from  the  results  given  in  the  proposition.  For  instance,  the  covariance  matrix 
of  the  joint  asymptotic  distribution  of  vec(<Pi,<Pj)  is 


dvec(<I>i,<Pj)  dvec((Pi,  <Pj)' 

~  2-Jrv  ' 


da' 


da 


where 


dvec(Pt,<Pj) 

da' 


dvec(<Pi) 

da' 

dvec  (Pj) 
da' 


etc.  We  have  chosen  to  state  the  proposition  for  individual  MA  coefficient  ma¬ 
trices  because  thereby  all  required  matrices  have  relatively  small  dimensions 
and,  hence,  are  easy  to  compute.  ■ 

Remark  8  Denoting  the  jfc-th  elements  of  <Pi  and  <9j  by  and  9jkti, 

respectively,  hypotheses  of  obvious  interest,  for  j  =£  k,  are 

H()  :  =  0  for  i=  1,2,...  (3.7.12) 

and 

H0:6jkii  =  0  for  *  =  0,1,2,...  (3.7.13) 

because  they  can  be  interpreted  as  hypotheses  on  noncausality  from  variable 
k  to  variable  j,  that  is,  an  impulse  in  variable  k  does  not  induce  any  response 
of  variable  j.  From  Chapter  2,  Propositions  2.4  and  2.5,  we  know  that  (3.7.12) 
is  equivalent  to 

Ho:<pjkti  =  0  for  *  =  l,2j .  . .  ,p(K  —  1)  (3.7.14) 

and  (3.7.13)  is  equivalent  to 

H0:6jk,i  =  0  for  *  =  0, 1, . . .  ,p(K  -  1).  (3.7.15) 

Using  Bonferroni’s  inequality  (see  Chapter  2,  Section  2.2.3),  a  test  of 
(3.7.14)  with  significance  level  at  most  1007%  is  obtained  by  rejecting  H0 
if 


I  V/T$ifc)i/CT03.fc(*)|  >  Z{j/2p(K-1))  (3.7.16) 

for  at  least  one  {1,2,...  ,p(K  —  1)}.  Here  Z(7)  is  the  upper  100y  percentage 
point  of  the  standard  normal  distribution  and  &c/>jk  (*)  is  an  estimate  of  the 
asymptotic  standard  deviation  er of  VT<j)jkti  obtained  via  Proposition 
3.6.  In  order  to  obtain  an  asymptotic  standard  normal  distribution  of  the 
f-ratio  \fT<f>jk,il'G<f>jk  (*),  the  variance  a must  be  nonzero,  however. 
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A  test  of  (3.7.15)  with  significance  level  at  most  7  is  obtained  by  rejecting 
Ho  if 

(>  z('i/2(pK-p+l)) 

>  Z{l/2(pK-p)) 

l 

(3.7.17) 


for  at  least  one 

i  e  {0, 1, 2, . . .  ,p(K  -  1)}  if  j  >  k 
for  at  least  one 

i  <E  {1,2  ,...,p(K  -  1)}  if  j  <  k. 


Here  agjk  (i)  is  a  consistent  estimator  of  the  standard  deviation  of  the  asymp¬ 
totic  distribution  of  VT0jk,i  obtained  from  Proposition  3.6  and  that  standard 
deviation  is  assumed  to  be  nonzero. 

A  test  based  on  Bonferroni’s  principle  may  have  quite  low  power  because 
the  actual  significance  level  may  be  much  smaller  than  the  given  upper  bound. 
Therefore  a  test  based  on  some  y2-  or  F-statistic  would  be  preferable.  Unfor¬ 
tunately,  such  tests  are  not  easily  available  for  the  present  situation.  The 
problem  is  similar  to  the  one  discussed  in  Section  3.6.4  in  the  context  of 
testing  for  multi-step  causality.  For  more  discussion  of  this  point  see  also 
Liitkepohl  (1990)  and  for  a  different  approach  of  representing  the  uncertainty 
in  estimated  impulse  responses  see  Sims  &  Zha  (1999).  ■ 


3.7.2  Proof  of  Proposition  3.6 


The  proof  of  Proposition  3.6  is  a  straightforward  application  of  the  matrix  dif¬ 
ferentiation  rules  given  in  Appendix  A.  13.  It  is  sketched  here  for  completeness 
and  because  it  is  spread  out  over  a  number  of  publications  in  the  literature. 
Readers  mainly  interested  in  applying  the  proposition  may  skip  this  section 
without  loss  of  continuity. 

To  prove  (3.7.5),  note  that  <&i  =  JAlJ'  (see  Chapter  2,  Section  2.1.2)  and 
apply  Rule  (8)  of  Appendix  A. 13.  The  expression  for  Fn  in  (3.7.6)  follows 
because 


<9vec(<ftyt) 

da' 


E 


d  vec(<?i) 
da' 


and 


dvec^oo)  _ 
da' 

-(3£,  »*«,) 


dvec^oc)  dvec^"1) 
dvec^i1)'  da' 
dvec (Ik  A\  —  •  ••  —  Ap) 
da' 


Furthermore, 


Ci  = 


<9vec(@i) 

da' 


d  vec(<Pi) 
da' 
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and 


Ci  =  2l?(a)  =  (jK8#()a,ec(e) 


da' 


'da' 


where 


9vec(P)  ,  9vech(P) 

8a'  K  9 a'  =  ’ 

follows  from  Appendix  A. 13,  Rule  (10).  The  matrices  Bni  Bn,  B^,  and  B ^  are 
obtained  in  a  similar  manner,  using  the  relations  Sn  =  <^„P  and  =  'PaoP. 
Finally,  in  (3.7.11), 


jk,h 


da' 
h- 1 


2j2^,Pek)(e'kP'  0  e')^g^MSE j(h) 


i= 0 
h- 1 


<9MSE  j(h) 
da' 


~E^e^ijPefc) 

i= 0 

h- 1 

=  E 


,aMSEj(/Q 

da’ 


MSE^/i)2, 


m=0  L 


/  r  _  ,  9vec(<Z>m) 
(e  0  e  •) 


+(e'  0  e) PmA.u) 


■?'  9a' 

5vec(<Z>'m) 


9a' 


=  ^  [(e' 0  e' )  +  (e'  0 


m= 0 
h-1 


—  [(e)^mT'u  0  e)-)  +  Kn(e)<?mA„  0  e))]  G „ 


m=0 

h-1 


=  2  ^  (e'^mSu  0  e'  )G„ 


m— 0 


(see  Appendix  A. 12. 2,  Rule  (23)) 


9w 


^ jk,h  — 


’jk,h 


da 

h-1 

=  E 

i=0  L 


2 (e'^Pefe)(e,fc  0  e'^)^^MSE,(/i) 


—{e'j$iPek) 


29MSEj(/t) 

da' 


MSEj(/i)2, 


and 


3.7.3  An  Example 

To  illustrate  the  results  of  Section  3.7.1,  we  use  again  the  investment /income/ 
consumption  example  from  Section  3.2.3.  Because 

"  -.320  .146  .961  " 

$1=A1  =  .044  -.153  .289  , 

-.002  .225  -.264 

the  elements  of  must  have  the  same  standard  errors  as  the  elements  of 
A\.  Checking  the  covariance  matrix  in  (3.7.5),  it  is  seen  that  the  asymptotic 
covariance  matrix  of  is  indeed  the  upper  left-hand  ( K 2  x  K 2)  block  of  27a 
because 

G\  =  J  ®  Ik  =  [Ik2  :  0  :  •  •  •  :  0], 

Thus,  the  square  roots  of  the  diagonal  elements  of 

‘  h  " 

1  0 
G127aG"1/T=  -{h  :0:  •••  :  0](1  V(O)"1  ®  27„)  . 

_  0 

are  estimates  of  the  asymptotic  standard  errors  of  <1>\.  Note  that  here  and  in 
the  following  we  use  the  LS  estimators  from  the  standard  form  of  the  VAR 
process  (see  Section  3.2)  and  not  the  mean-adjusted  form.  Accordingly,  the 
estimate  /Y(0)-1  is  obtained  from  (ZZ'/T)-1  by  deleting  the  first  row  and 
column. 

From  (2.1.22)  we  get 

r  -.054  .262  .416  ' 

$2=$1A1  +  A2  =  .029  .114  -.088  . 

.045  .261  .110 

To  estimate  the  corresponding  standard  errors,  we  note  that 


G2  =  JA.'  ®IK  +  i. 
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Replacing  the  unknown  quantities  by  the  usual  estimates  gives 

^g2S&g'2  =  ^[jA'rY(0)-1Aj'®zu  +  jA'rY{o)-1j'®Eu$\ 

+  jfy(0)-1Aj'  (8)  $1SU  +  jfy(O)-1  J’  ®  Sv$[]  . 

The  square  roots  of  the  diagonal  elements  of  this  matrix  are  estimates  of 
the  standard  deviations  of  the  elements  of  <P2  and  so  on.  Some  matrices 
together  with  estimated  standard  errors  are  given  in  Table  3.3.  In  Figures 
3.4  and  3.5,  some  impulse  responses  are  depicted  graphically  along  with  two- 
standard  error  bounds. 


o 


I 


Fig.  3.4.  Estimated  responses  of  consumption  to  a  forecast  error  impulse  in  income 
with  estimated  asymptotic  two-standard  error  bounds. 


In  Figure  3.4,  consumption  is  seen  to  increase  in  response  to  a  unit  shock 
in  income.  However,  under  a  two-standard  error  criterion  (approximate  95% 
confidence  bounds)  only  the  second  response  coefficient  is  significantly  differ¬ 
ent  from  zero.  Of  course,  the  large  standard  errors  of  the  impulse  response 
coefficients  reflect  the  substantial  estimation  uncertainty  in  the  VAR  coeffi¬ 
cient  matrices  A\  and  A2. 

To  check  the  overall  significance  of  the  response  coefficients  of  consumption 
to  an  income  impulse,  we  may  use  the  procedure  described  in  Remark  8  of 
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Table  3.3.  Estimates  of  impulse  responses  for  the  investment/income/con- 
sumption  system  with  estimated  asymptotic  standard  errors  in  parentheses 


i 

& 

~  -0.320 

0.146 

0.961 

"  0.680 

0.146 

0.961  " 

(0.125) 

(0.562) 

(0.657) 

(0.125) 

(0.562) 

(0.657) 

0.044 

-0.153 

0.289 

0.044 

0.847 

0.289 

(0.032) 

(0.143) 

(0.167) 

(0.032) 

(0.143) 

(0.167) 

-0.002 

0.225 

-0.264 

-0.002 

0.225 

0.736 

.  (0.025) 

(0.115) 

(0.134)  . 

.  (0.025) 

(0.115) 

(0.134)  . 

'  -0.054 

0.262 

0.416 

"  0.626 

0.408 

1.377  " 

(0.129) 

(0.546) 

(0.663) 

(0.148) 

(0.651) 

(0.755) 

0.029 

0.114 

-0.088 

0.073 

0.961 

0.200 

(0.032) 

(0.135) 

(0.162) 

(0.043) 

(0.192) 

(0.222) 

0.045 

0.261 

0.110 

0.043 

0.486 

0.846 

.  (0.026) 

(0.108) 

(0.131)  . 

.  (0.033) 

(0.144) 

(0.167)  . 

'  0.119 

0.353 

-0.408 

"  0.745 

0.761 

0.969  " 

(0.084) 

(0.384) 

(0.476) 

(0.099) 

(0.483) 

(0.550) 

-0.009 

0.071 

0.120 

0.064 

1.033 

0.320 

(0.016) 

(0.078) 

(0.094) 

(0.037) 

(0.176) 

(0.203) 

-0.001 

-0.098 

0.091 

0.042 

0.388 

0.937 

.  (0.017) 

(0.078) 

(0.102)  . 

.  (0.033) 

(0.156) 

(0.183)  . 

"  0.756 

0.836 

1.295  " 

(0.133) 

(0.661) 

(0.798) 

0 

0.076 

1.076 

0.344 

(0.048) 

(0.236) 

(0.285) 

0.053 

0.505 

0.964 

.  (0.043) 

(0.213) 

(0.257)  . 

Section  3.7.1.  That  is,  we  have  to  check  the  significance  of  the  first  p(K  — 
1)  =  4  response  coefficients.  Because  one  of  them  is  individually  significant 
at  an  asymptotic  5%  level  we  may  reject  the  null  hypothesis  of  no  response 
of  consumption  to  income  impulses  at  a  significance  level  not  greater  than 
4x5%  =  20%.  Of  course,  this  is  not  a  significance  level  we  are  used  to 
in  applied  work.  However,  it  becomes  clear  from  Table  3.3  that  the  second 
response  coefficient  <^32,2  is  still  significant  if  the  individual  significance  levels 
are  reduced  to  2.5%.  Note  that  the  upper  1.25  percentage  point  of  the  standard 
normal  distribution  is  C0.0125  =  2.24.  Thus,  we  may  reject  the  no-response 
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Fig.  3.5.  Estimated  responses  of  investment  to  a  forecast  error  impulse  in  consump¬ 
tion  with  estimated  asymptotic  two-standard  error  bounds. 


hypothesis  at  an  overall  4  x  2.5%  =  10%  level  which  is  clearly  a  more  common 
size  for  a  test  in  applied  work.  Still,  in  this  exercise,  the  data  do  not  reveal 
strong  evidence  for  the  intuitively  appealing  hypothesis  that  consumption 
responds  to  income  impulses.  In  later  chapters,  we  will  see  how  the  coefficients 
can  potentially  be  estimated  with  more  precision. 

In  Figure  3.5,  the  responses  of  investment  to  consumption  impulses  are  de¬ 
picted.  None  of  them  is  significant  under  a  two-standard  error  criterion.  This 
result  is  in  line  with  the  Granger-causality  analysis  in  Section  3.6.  In  that  sec¬ 
tion,  we  did  not  find  evidence  for  Granger-causality  from  income/consumption 
to  investment.  Assuming  that  the  test  result  describes  the  actual  situation, 
the  <^i3 y  must  be  zero  for  i  =  1,2, . . .  (see  also  Chapter  2,  Section  2.3.1). 

The  covariance  matrix  of 


.680 

.146 

.961 

.044 

.847 

.289 

-.002 

.225 

.736 

is,  of  course,  the  same  as  that  of  and  an  estimate  of  the  covariance  matrix 
of  the  elements  of 
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.626 

.408 

1.377 

.073 

.961 

.200 

.043  .486 

.846 

<P2=  h+&  1+^2  = 

is  obtained  as  {G\-\-G2)Sa{Gi-\-G2)' /T .  Some  accumulated  impulse  responses 


together  with  estimated  standard  errors  are  also  given  in  Table  3.3  and  accu¬ 
mulated  responses  of  consumption  to  income  impulses  and  of  investment  to 
consumption  impulses  are  shown  in  Figures  3.6  and  3.7,  respectively.  They 
reinforce  the  findings  for  the  individual  impulse  responses  in  Figures  3.4  and 


3.5. 


o 


Fig.  3.6.  Accumulated  and  long-run  responses  of  consumption  to  a  forecast  error 
impulse  in  income  with  estimated  asymptotic  two-standard  error  bounds. 


An  estimate  of  the  asymptotic  covariance  matrix  of  the  estimated  long-run 
responses  =  (7a  —  A\  —  A2)~l  is 

The  matrix  together  with  the  resulting  standard  errors  is  also  given  in  Ta¬ 
ble  3.3.  For  instance,  the  total  long-run  effect  V’13,00  of  a  consumption  impulse 


Fig.  3.7.  Accumulated  and  long-run  responses  of  investment  to  a  forecast  error 
impulse  in  consumption  with  estimated  asymptotic  two-standard  error  bounds. 


on  investment  is  1.295  and  its  estimated  asymptotic  standard  error  is  .798. 
Not  surprisingly,  ipi3tCX>  is  not  significantly  different  from  zero  for  any  common 
level  of  significance  (e.g.,  10%).  On  the  other  hand,  ^32,00)  the  long-run  effect 
on  consumption  due  to  an  impulse  in  income,  is  significant  at  an  asymptotic 
5%  level.  ^ 

For  the  interpretation  of  the  ^j’s,  the  critical  remarks  at  the  end  of  Chapter 
2  must  be  kept  in  mind.  As  explained  there,  the  <Pi  and  coefficients  may  not 
reflect  the  actual  responses  of  the  variables  in  the  system.  As  an  alternative, 
one  may  want  to  determine  the  responses  to  orthogonal  residuals.  In  order  to 
obtain  the  asymptotic  covariance  matrices  of  the  <9*  and  a  decomposition 
of  Su  is  needed.  For  our  example, 

'  4.61  0  0  " 

P  =  .16  1.16  0  x  10"2 

.27  .49  .76 

is  the  lower  triangular  matrix  with  positive  diagonal  elements  satisfying  PP'  = 
ZJU  (Choleski  decomposition).  The  asymptotic  covariance  matrix  of  vec (P)  = 
vec(6>o)  is  a  (9  x  9)  matrix  which  is  estimated  as 
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^CoZzC'o  =  ^HT>+{ZU  ®  EU)B 
where,  as  usual,  =  (D'kT) k)~1B'k  and 
H  =  4  |l3  [(/3  ®  P)K33  +  (P  ®  j3)]  4}"1 . 

The  resulting  estimated  asymptotic  standard  errors  of  the  elements  of  P  are 
given  in  Table  3.4.  Note  that  the  variances  corresponding  to  elements  above 
the  main  diagonal  of  P  are  all  zero  because  these  elements  are  zero  by  defini¬ 
tion  and  are  not  estimated. 

The  asymptotic  covariance  matrix  of  the  elements  of 

'  -1.196  .644  .730  ' 

<9i  =  .256  -.035  .219  x  1CT2 

-.047  .131  -.201 

is  obtained  as  the  sum  of  the  two  matrices 

C1E&C[/T=  [(P,®l3)Gi£aG,1(P®73)]/T 
and 

C1ZaC'1/T  =  (h  ®  ®  $\)/T. 

The  resulting  standard  errors  for  the  elements  of  0\  are  given  in  Table  3.4 
along  with  some  more  (9,  and  Hn  matrices. 

Some  responses  and  accumulated  responses  of  consumption  to  income  in¬ 
novations  with  two-standard  error  bounds  are  depicted  in  Figures  3.8  and  3.9. 
The  responses  in  Figures  3.4  and  3.8  are  obviously  a  bit  different.  Note  the 
(significant)  immediate  reaction  of  consumption  in  Figure  3.8.  However,  from 
period  1  onwards  the  response  of  consumption  in  both  figures  is  qualitatively 
similar.  The  difference  of  scales  is  due  to  the  different  sizes  of  the  shocks 
traced  through  the  system.  For  instance,  Figure  3.4  is  based  on  a  unit  shock 
in  income  while  Figure  3.8  is  based  on  an  innovation  of  size  one  standard 
deviation  due  to  the  transformation  of  the  white  noise  residuals. 

Again,  a  test  of  overall  significance  of  the  impulse  responses  in  Figure  3.8 
could  be  performed  using  Bonferroni’s  principle.  Now  we  have  to  check  the 
significance  of  the  032,;’s  for  i  =  0, 1, ...  ,4  =  p(K  —  1).  We  reject  the  null 
hypothesis  of  no  response  if  at  least  one  of  the  coefficients  is  significantly 
different  from  zero.  In  this  case,  we  can  reject  at  an  asymptotic  5%  level  of 
significance  because  632,0  is  significant  at  the  1%  level  (see  Table  3.4).  Thus, 
we  may  choose  individual  significance  levels  of  1%  for  each  of  the  5  coefficients 
and  obtain  5%  as  an  upper  bound  for  the  overall  level.  Of  course,  all  these 
interpretations  are  based  on  the  assumption  that  the  actual  asymptotic  stan¬ 
dard  errors  of  the  impulse  responses  are  nonzero  (see  Section  3.7.1,  Remark 

I)- 
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Table  3.4.  Estimates  of  responses  to  orthogonal  innovations  for  the  invest- 
ment /income/consumption  system  with  estimated  asymptotic  standard  errors 
in  parentheses 


i 

6 

k 

■  4.61 

0 

0  ' 

'  4.61 

0 

0  1 

(.38) 

(.38) 

.16 

1.16 

0 

_  „_o 

.16 

1.16 

0 

_  „_9 

0 

(.14) 

(.10) 

x  10 

(.14) 

(.10) 

x  10 

.27 

.49 

.76 

.27 

.49 

.76 

-  (-11) 

(.10) 

(.06)  . 

-  (-H) 

(.10) 

(.06)  J 

'  -1.20 

.64 

.73  ' 

'  3.46 

.64 

.73  1 

(.57) 

(.56) 

(.50) 

(.63) 

(.56) 

(.50) 

.26 

-.04 

.22 

-  _ o 

.41 

1.13 

.22 

„  „ _ 9 

1 

(.14) 

(.14) 

(.13) 

x  10 

(.20) 

(.17) 

(.13) 

x  10 

-.05 

.13 

-.20 

.22 

.62 

.56 

.  (-12) 

(.12) 

(•10)  . 

-  (-15) 

(.14) 

(•11)  J 

'  -.10 

.51 

.32  ' 

'  3.32 

1.15 

1.05  1 

(.58) 

(.57) 

(.50) 

(.74) 

(.69) 

(.58) 

.13 

.09 

-.07 

„  _ 9 

.54 

1.22 

.15 

,, _ 9 

2 

(.14) 

(.14) 

(.12) 

x  10 

(.24) 

(.22) 

(.17) 

x  10 

.28 

.36 

.08 

.50 

.98 

.64 

-  (-12) 

(.12) 

(•10)  . 

-  (-20) 

(.18) 

(•14)  J 

'  3.97 

1.61 

.98  1 

(.82) 

(.92) 

(.61) 

.61 

1.42 

.26 

oo 

0 

(.31) 

(.34) 

(.22) 

x  10  “ 

.58 

1.06 

.73 

-  (-28) 

(.32) 

(.20)  J 

We  have  also  performed  forecast  error  variance  decompositions  and  we 
have  computed  the  standard  errors  on  the  basis  of  the  results  given  in  Propo¬ 
sition  3.6.  For  some  forecast  horizons  the  decompositions  are  given  in  Table 
3.5.  The  standard  errors  may  be  regarded  as  rough  indications  of  the  sampling 
uncertainty.  It  must  be  kept  in  mind,  however,  that  they  may  be  quite  mis¬ 
leading  if  the  true  forecast  error  variance  components  are  zero,  as  explained 
in  Remark  6  of  Section  3.7.1.  Obviously,  this  qualification  limits  their  value  in 
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Fig.  3.8.  Estimated  responses  of  consumption  to  an  orthogonalized  impulse  in 
income  with  estimated  asymptotic  two-standard  error  bounds. 


the  present  example.  Students  are  invited  to  reproduce  the  numbers  in  Table 
3.5  and  the  previous  tables  of  this  section. 

3.7.4  Investigating  the  Distributions  of  the  Impulse  Responses  by 
Simulation  Techniques 

In  the  previous  subsections,  it  was  indicated  repeatedly  that  in  some  cases 
the  small  sample  validity  of  the  asymptotic  results  is  problematic.  In  that 
situation,  one  possibility  is  to  use  Monte  Carlo  or  bootstrapping  methods  for 
investigating  the  sampling  properties  of  the  quantities  of  interest.  Although 
these  methods  are  quite  expensive  in  terms  of  computer  time,  they  were  used 
in  the  past  for  evaluating  the  properties  of  impulse  response  functions  (see, 
e.g.,  Runkle  (1987)  and  Kilian  (1998,  1999)).  The  general  methodology  is 
described  in  Appendix  D. 

In  the  present  situation,  there  are  different  approaches  to  simulation.  One 
possibility  is  to  assume  a  specific  distribution  of  the  white  noise  process,  e.g., 
Ut  ~  A/"(0,  Su),  and  generate  a  large  number  of  time  series  realizations  based 
on  the  estimated  VAR  coefficients.  From  these  time  series,  new  sets  of  coef¬ 
ficients  are  then  estimated  and  the  corresponding  impulse  responses  and/or 
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Fig.  3.9.  Estimated  accumulated  and  long-run  responses  of  consumption  to  an 
orthogonalized  impulse  in  income  with  estimated  asymptotic  two-standard  error 
bounds. 


forecast  error  variance  components  are  computed.  The  empirical  distributions 
obtained  in  this  way  may  be  used  to  investigate  the  actual  distributions  of 
the  quantities  of  interest. 

Alternatively,  if  an  assumption  regarding  the  white  noise  distribution  can¬ 
not  be  made,  bootstrap  methods  may  be  used  and  new  sets  of  residuals  may 
be  drawn  from  the  estimation  residuals.  A  large  number  of  yt  time  series  is 
generated  on  the  basis  of  these  sets  of  disturbances.  The  bootstrap  multiple 
time  series  obtained  in  this  way  are  then  used  to  compute  estimates  of  the 
quantities  of  interest  and  study  their  properties.  Three  different  methods  for 
computing  bootstrap  confidence  intervals  in  the  present  context  are  described 
in  Appendix  D.3.  We  have  used  the  standard  and  the  Hall  percentile  methods 
to  compute  confidence  intervals  for  the  response  of  consumption  to  a  fore¬ 
cast  error  impulse  and  an  orthogonalized  impulse  in  income  for  our  example 
system.  The  results  are  shown  in  Figures  3.10  and  3.11,  respectively. 

Some  interesting  observations  can  be  made.  First,  for  the  forecast  error  im¬ 
pulse  responses,  the  two  different  methods  for  establishing  confidence  intervals 
produce  quite  similar  results  which  are  also  at  least  qualitatively  similar  to 
the  asymptotic  confidence  intervals  in  Figure  3.4.  Second,  the  situation  is  a 
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Table  3.5.  Forecast  error  variance  decomposition  of  the  investment /income/con¬ 
sumption  system  with  estimated  asymptotic  standard  errors  in  parentheses 


proportions  of  forecast  error  variance,  h  periods 
ahead,  accounted  for  by  innovations  in 


forecast 

error 

in 

forecast 

horizon 

h 

investment 

bJj  1 ,  h 

income 

bJj  2 ,  h 

consumption 

bJj  3 ,  h 

investment 

1 

l.OO(.OO) 

.OO(.OO) 

.OO(.OO) 

(3  =  1) 

2 

.96(.04) 

.02(.03) 

.02(.03) 

3 

.95(.04) 

.03(.03) 

.03(.03) 

4 

.94(.05) 

.03(.03) 

.03(.03) 

8 

.94(.05) 

.03(.03) 

.03(.04) 

income 

1 

.02(.03) 

.98(.03) 

.OO(.OO) 

(3  =  2) 

2 

.06(.05) 

.91(.06) 

.03(.04) 

3 

.07(.06) 

.90(.07) 

.03(.04) 

4 

.07(.06) 

.89(.07) 

.04(.04) 

8 

.07(.06) 

.89(.07) 

.04(.04) 

consumption 

1 

.08(.06) 

.27(.09) 

.65(.09) 

(3  =  3) 

2 

.08(.06) 

.27(.08) 

.65(.09) 

3 

.13(.08) 

.33(.09) 

.54(.09) 

4 

.13(.08) 

.34(.09) 

.54(.09) 

8 

.13(.08) 

.34(.09) 

.53(.09) 

bit  different  for  the  orthogonalized  impulse  responses  in  Figure  3.11.  Here  the 
two  different  bootstrap  methods  produce  rather  different  confidence  intervals. 
These  intervals  are  quite  asymmetric  in  the  sense  that  the  estimated  impulse 
responses  are  not  in  the  middle  between  the  lower  and  upper  bound  of  the  in¬ 
tervals.  Thereby  they  also  look  quite  differently  from  the  asymptotic  intervals 
shown  in  Figure  3.8.  The  latter  intervals  are  symmetric  around  the  estimated 
impulse  response  coefficients  by  construction.  Again,  the  qualitative  interpre¬ 
tation  does  not  change,  however.  In  other  words,  the  instantaneous  and  the 
second  coefficient  are  significantly  different  from  zero,  as  before.  Moreover, 
the  confidence  intervals  in  Figure  3.11  are  consistent  with  a  rapidly  declining 
effect  of  an  impulse  in  income. 

It  must  be  emphasized,  however,  that  the  bootstrap  generally  does  not 
solve  the  problem  of  a  singular  asymptotic  distribution  of  the  impulse  re¬ 
sponses  and  the  resulting  potentially  invalid  inference.  If  the  asymptotic  dis¬ 
tribution  is  singular,  the  bootstrap  may  fail  to  produce  meaningful  confidence 
intervals,  for  example.  Again  it  may  be  worth  considering  a  univariate  AR(1) 
process  yt  =  ayt- 1  +  ut  for  illustrative  purposes.  The  second  forecast  error 
impulse  response  coefficient  is  <P 2  =  a2.  The  corresponding  estimator  @2  =  S2 
was  found  to  have  a  singular  asymptotic  distribution  if  a  =  0  (see  Remark  1 
in  Section  3.7.1).  Suppose  a  bootstrap  is  used  to  produce  N  bootstrap  esti¬ 
mates  of  a,  n  =  1, . . . ,  N.  Clearly,  the  corresponding  bootstrap  estimates 
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Fig.  3.10.  Estimated  responses  ( - )  of  consumption  to  a  forecast  error  impulse  in 

income  with  95%  bootstrap  confidence  bounds  based  on  2000  bootstrap  replications 
( - standard  intervals, - Hall’s  percentile  intervals). 


<?2(„)  =  3*3  will  all  be  positive  with  probability  one  because  they  are  squares. 
Thus,  if  the  standard  (1  —  7)100%  bootstrap  confidence  interval  is  constructed 
in  the  usual  way  by  choosing  4>2(Nj/2)  anc^  ^2(at(i-7)/2)  as  lower  and  upper 
bound,  respectively,  the  true  value  of  zero  will  never  be  within  the  confidence 
interval.  Hence,  in  this  case  the  actual  confidence  level  will  be  zero.  Although 
the  Hall  confidence  intervals  may  be  a  bit  better  in  this  case,  they  will  also 
not  provide  the  desired  coverage  level  even  in  large  samples.  A  more  detailed 
discussion  of  this  problem  is  given  by  Benkwitz  et  al.  (2000),  where  also  meth¬ 
ods  for  correct  asymptotic  inference  are  considered.  One  possible  solution  is 
to  eliminate  all  points  where  nonsingularities  of  the  asymptotic  distribution 
may  occur  by  fitting  subset  models  (see  Chapter  5) .  Another  possibility  to  cir¬ 
cumvent  the  problem  is  to  allow  the  VAR  process  to  be  of  infinite  order  and 
increase  the  order  with  growing  sample  size.  This  possibility  will  be  discussed 
in  detail  in  Chapter  15. 
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Fig.  3.11.  Estimated  responses  ( - )  of  consumption  to  an  orthogonalized  im¬ 

pulse  in  income  with  95%  bootstrap  confidence  bounds  based  on  2000  bootstrap 
replications  ( - standard  intervals, - Hall’s  percentile  intervals). 


3.8  Exercises 


3.8.1  Algebraic  Problems 


The  notation  of  Sections  3. 2-3. 5  is  used  in  the  following  problems. 
Problem  3.1 

Show  that  (3  =  (( ZZ')~1Z  ®  Ik) y  minimizes 

S{(3)  =  u'u  =  [y  -  {Z'  <g>  Ik)(3]'[ y  -  {Z'  <g>  IK)0\. 

Problem  3.2 
Prove  that 

V^b  -  b)  -i  A7(0,  Su  ®  -F-1), 


if  yt  is  stable  and 


vec(ZU') 


^={iK  m  z)vec(u')  -4  Af(o,  uu  ®  r). 
\/T 
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Problem  3.3 

Show  (3.4.17).  (Hint:  Use  the  product  rule  for  matrix  differentiation  and 
dvectU-^/dvec^)'  =  <g>  U"1.) 

Problem  3-4 

Derive  (3.4.18).  (Hint:  Use  the  last  expression  given  in  (3.4.6).) 

Problem  3.5 
Show  (3.4.19). 

Problem  3.6 
Derive  (3.4.20). 

Problem  3.7 

Prove  that  plim  zt/VT  =  0,  where 

p  i—  1 

zt  =  J2  -  vt-j)- 

2—1  j~  0 


(Hint:  Show  that  E(zt/VT)  — >  0  and  Var(zT / \/T)  — >  0.) 


Problem  3.8 

Show  that  Equation  (3.5.10)  holds. 
(Hint:  Define 


Zt(h) 


1 

Vt(h) 


yt(h-p+  l) 


and  show  Zt(h)  =  BZt(h  —  1)  by  induction.) 

Problem  3.9 

In  the  context  of  Section  3.5,  suppose  that  yt  is  a  stable  Gaussian  VAR(p) 
process  which  is  estimated  by  ML  in  mean-adjusted  form.  Show  that  the 
forecast  MSE  correction  term  has  the  form 


Q{h)  =  E 


( dyt(h )  ^  dyt{h)'\ 
{  dy'  *  dy  ) 


+  E 


( dyt{h)  ^  dyt{h)'\ 
V  da'  “  da  ) 


with 


dyt{h) 

dy' 


=  Ik~  JAh 


Ik 

Ik 


( KpxK ) 


and 
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-  n)'{A')h~1~i  ® 

2=0 

Here  fi  :=  (//, . . . ,  //)'  is  a  ( Kp  x  1)  vector,  Yt  and  A  are  as  defined  in  (2.1.8), 
J  :=  [Ik  ■  0  :  •  •  •  :  0]  is  a  (A'  x  Kp)  matrix,  and  is  the  i-tli  coefficient 
matrix  of  the  prediction  error  MA  representation  (2.1.17). 

Problem  3.10 

Derive  the  ML  estimator  and  its  asymptotic  distribution  for  the  parameter  of 
a  stable  AR(1)  process,  yt  =  aryt-i  +  ut,  ut  ~  i.i.d.  A/"(0,<j^). 

3.8.2  Numerical  Problems 

The  following  problems  require  the  use  of  a  computer.  They  are  based  on 
the  two  quarterly,  seasonally  adjusted  U.S.  investment  series  given  in  File  E2. 
Consider  the  variables 

2/i  -  first  differences  of  fixed  investment, 

2/2  ^  first  differences  of  change  in  business  inventories, 

in  the  following  problems.  Use  the  data  from  1947  to  1968  only. 

Problem  3.11 

Plot  the  two  time  series  2/it  and  j/2t  and  comment  on  the  stationarity  and 
stability  of  the  series. 

Problem  3.12 

Estimate  the  parameters  of  a  VAR(l)  model  for  (yit,y2t)'  using  multivariate 
LS,  that  is,  compute  B  and  Eu.  Comment  on  the  stability  of  the  estimated 
process. 

Problem  3.13 

Use  the  mean-adjusted  form  of  a  VAR(l)  model  and  estimate  the  coefficients. 
Assume  that  the  data  generation  process  is  Gaussian  and  estimate  the  covari¬ 
ance  matrix  of  the  asymptotic  distribution  of  the  ML  estimators. 

Problem  3.14 

Determine  the  Yule- Walker  estimate  of  the  VAR(l)  coefficient  matrix  and 
compare  it  to  the  LS  estimate. 

Problem  3.15 

Use  the  LS  estimate  and  compute  point  forecasts  2/86  (1)>  2/86  (2)  (that  is,  the 
forecast  origin  is  the  last  quarter  of  1968)  and  the  corresponding  MSE  matrices 
Ey(l),  Sy(2),  Ug(l),  and  Ag(2).  Use  these  estimates  to  set  up  approximate 
95%  interval  forecasts  assuming  that  the  process  yt  is  Gaussian. 

Problem  3.16 

Test  the  hypothesis  that  2/2  does  not  Granger-cause  2/1  - 
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Problem  3.17 

Estimate  the  coefficient  matrices  <Pi  and  <P 2  from  the  LS  estimates  of  the 
VAR(l)  model  for  yt  and  determine  approximate  standard  errors  of  the  esti¬ 
mates. 

Problem  3.18  ^ 

Determine  the  upper  triangular  matrix  P  with  positive  diagonal  for  which 
PP'  =  Uu.  Estimate  the  covariance  matrix  of  the  asymptotic  distribution  of 
P  under  the  assumption  that  yt.  is  Gaussian.  Test  the  hypothesis  that  the 
upper  right-hand  corner  element  of  the  underlying  matrix  P  is  zero. 

Problem  3.19 

Use  the  results  of  the  previous  problems  to  compute  <9 0,  0 1,  and  O2  ■  Deter¬ 
mine  also  estimates  of  the  asymptotic  standard  errors  of  the  elements  of  these 
three  matrices. 


4 


VAR  Order  Selection  and  Checking  the  Model 
Adequacy 


4.1  Introduction 

In  the  previous  chapter,  we  have  assumed  that  we  have  given  a  H-dimensional 
multiple  time  series  yi, ...  ,  yr,  with  yt  =  (yu,  •  •  • ,  yKt)' ,  which  is  known  to 
be  generated  by  a  VAR(p)  process, 

yt  =  v  +  A\yt-i  +  •  •  •  +  Apyt-p  +  Ut ,  (4.1.1) 

and  we  have  discussed  estimation  of  the  parameters  Ap,  and  Eu  = 

E(utu't).  In  deriving  the  properties  of  the  estimators,  a  number  of  assumptions 
were  made.  In  practice,  it  will  rarely  be  known  with  certainty  whether  the 
conditions  hold  that  are  required  to  derive  the  consistency  and  asymptotic 
normality  of  the  estimators.  Therefore  statistical  tools  should  be  used  in  order 
to  check  the  validity  of  the  assumptions  made.  In  this  chapter,  some  such  tools 
will  be  discussed. 

In  the  next  two  sections,  it  will  be  discussed  what  to  do  if  the  VAR  order 
p  is  unknown.  In  practice,  the  order  will  usually  be  unknown.  In  Chapter  3, 
we  have  assumed  that  a  VAR(p)  process  such  as  (4.1.1)  represents  the  data 
generation  process.  We  have  not  assumed  that  all  the  A,  are  nonzero.  In 
particular  Ap  may  be  zero.  In  other  words,  p  is  just  assumed  to  be  an  upper 
bound  for  the  VAR  order.  On  the  other  hand,  from  (3.5.13)  we  know  that  the 
approximate  MSE  matrix  of  the  1-step  ahead  predictor  will  increase  with  the 
order  p.  Thus,  choosing  p  unnecessarily  large  will  reduce  the  forecast  precision 
of  the  corresponding  estimated  VAR(p)  model.  Also,  the  estimation  precision 
of  the  impulse  responses  depends  on  the  precision  of  the  parameter  estimates. 
Therefore  it  is  useful  to  have  procedures  or  criteria  for  choosing  an  adequate 
VAR  order. 

In  Sections  4. 4-4. 6,  possibilities  are  discussed  for  checking  some  of  the  as¬ 
sumptions  of  the  previous  chapters.  The  asymptotic  distribution  of  the  resid¬ 
ual  autocorrelations  and  so-called  portmanteau  tests  are  considered  in  Section 
4.4.  The  latter  tests  are  popular  tools  for  checking  the  whiteness  of  the  residu¬ 
als.  More  precisely,  they  are  used  to  test  for  nonzero  residual  autocorrelation. 
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In  Section  4.5,  tests  for  nonnormality  are  considered.  The  normality  assump¬ 
tion  was  used  in  Chapter  3  in  setting  up  forecast  intervals. 

One  assumption  underlying  much  of  the  previous  analysis  is  the  station- 
arity  of  the  systems  considered.  Nonstationarities  may  have  various  forms. 
Not  only  trends  indicate  deviations  from  stationarity  but  also  changes  in  the 
variability  or  variance  of  the  system.  Moreover,  exogenous  shocks  may  affect 
various  characteristics  of  the  system.  Tests  for  structural  change  are  presented 
in  Section  4.6. 


4.2  A  Sequence  of  Tests  for  Determining  the  VAR  Order 

Obviously,  there  is  not  just  one  correct  VAR  order  for  the  process  (4.1.1).  In 
fact,  if  (4.1.1)  is  a  correct  summary  of  the  characteristics  of  the  process  yt , 
then  the  same  is  true  for 

Ut  =  v  +  +  •  •  •  +  Apyt_p  +  Ap+iyt_p_i  +  ut 

with  Ap+ 1  =  0.  In  other  words,  if  yt  is  a  VAR(p)  process,  in  this  sense  it  is 
also  a  VAR(p  +  1)  process.  In  the  assumptions  of  the  previous  chapter,  the 
possibility  of  zero  coefficient  matrices  is  not  excluded.  In  this  chapter,  it  is 
practical  to  have  a  unique  number  that  is  called  the  order  of  the  process. 
Therefore,  in  the  following  we  will  call  yt  a  VAR(p)  process  if  Ap  ^  0  and 
At  =  0  for  i  >  p  so  that  p  is  the  smallest  possible  order.  This  unique  number 
will  be  called  the  VAR  order. 


4.2.1  The  Impact  of  the  Fitted  VAR  Order  on  the  Forecast  MSE 


If  yt,  is  a  VAR(p)  process,  it  is  useful  to  fit  a  VAR(p)  model  to  the  available 
multiple  time  series  and  not,  for  instance,  a  VAR(p  +  i )  because,  under  a 
mean  square  error  measure,  forecasts  from  the  latter  process  will  be  inferior 
to  those  based  on  an  estimated  VAR(p)  model.  This  result  follows  from  the 
approximate  forecast  MSE  matrix  Ey(h)  derived  in  Section  3.5.2  of  Chapter 
3.  For  instance,  for  h  =  1, 


^s(i)  = 


T  +  Kp+1 
T 


E,, 


if  a  VAR(p)  model  is  fitted  to  data  generated  by  a  -dimensional  VAR  process 
with  order  not  greater  than  p.  Obviously,  27g(l)  is  an  increasing  function  of 
the  order  of  the  model  fitted  to  the  data. 

Because  the  approximate  MSE  matrix  is  derived  from  asymptotic  theory, 
it  is  of  interest  to  know  whether  the  result  remains  true  in  small  samples. 
To  get  some  feeling  for  the  answer  to  this  question,  we  have  generated  1000 
Gaussian  bivariate  time  series  with  a  process  similar  to  (3.2.25), 
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'  .02  " 

'  .5  .1  " 

o 

o 

yt  = 

.03 

+ 

.4  .5 

yt-i  + 

.25  0 

Zu 


.09  0 

0  .04 


(4.2.1) 


We  have  fitted  VAR(2),  VAR(4),  and  VAR(6)  models  to  the  generated  series 
and  we  have  computed  forecasts  with  the  estimated  models.  Then  we  have 
compared  these  forecasts  to  generated  post-sample  values.  The  resulting  av¬ 
erage  squared  forecasting  errors  for  different  forecast  horizons  h  and  sample 
sizes  T  are  shown  in  Table  4.1.  Obviously,  the  forecasts  based  on  estimated 
VAR(2)  models  are  clearly  superior  to  the  VAR(4)  and  VAR(6)  forecasts  for 
sample  sizes  T  =  30,  50,  and  100.  While  the  comparative  advantage  of  the 
VAR(2)  models  is  quite  dramatic  for  T  =  30,  it  diminishes  with  increasing 
sample  size.  This,  of  course,  was  to  be  expected  given  that  the  approximate 
forecast  MSE  matrix  of  an  estimated  process  approaches  that  of  the  known 
process  as  the  sample  size  increases  (see  Section  3.5). 


Table  4.1.  Average  squared  forecast  errors  for  the  estimated  bivariate  VAR(2) 
process  (4.2.1)  based  on  1000  realizations 

sample  forecast  average  squared  forecast  errors 

size  horizon  VAR(2)  VAR(4)  VAR(6) 


T 

h 

2/i 

2/2 

2/i 

2/2 

2/i 

2/2 

i 

.111 

.052 

.132 

.062 

.165 

.075 

30 

2 

.155 

.084 

.182 

.098 

.223 

.119 

3 

.146 

.141 

.183 

.166 

.225 

.202 

1 

.108 

.043 

.119 

.048 

.129 

.054 

50 

2 

.132 

.075 

.144 

.083 

.161 

.093 

3 

.142 

.120 

.150 

.130 

.168 

.145 

1 

.091 

.044 

.095 

.046 

.098 

.049 

100 

2 

.120 

.064 

.125 

.067 

.130 

.069 

3 

.130 

.108 

.135 

.113 

.140 

.113 

Of  course,  the  process  considered  in  this  example  is  a  very  special  one.  To 
see  whether  a  similar  result  is  obtained  for  other  processes  as  well,  we  have 
also  generated  1000  three-dimensional  time  series  with  the  VAR(l)  process 
(2.1.14), 


"  .01  ' 

‘  .5 

0  0  ' 

'  2.25  0  0  " 

.02 

0 

+ 

.1 

0 

.1  .3 
.2  .3 

yt- 1  +  ut  with  Eu  = 

0  1.0  .5 

0  .5  .74 

(4.2.2) 


yt  = 
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We  have  fitted  VAR(l),  VAR(3),  and  VAR(6)  models  to  these  data  and  we 
have  computed  forecasts  and  forecast  errors.  Some  average  squared  forecast 
errors  are  presented  in  Table  4.2.  Again  forecasts  from  lower  order  models  are 
clearly  superior  to  higher  order  models.  In  fact,  in  a  large  scale  simulation 
study  involving  many  more  processes,  similar  results  were  found  (see  Liitke- 
pohl  (1985)).  Thus,  it  is  useful  to  avoid  fitting  VAR  models  with  unnecessarily 
large  orders. 


Table  4.2.  Average  squared  forecast  errors  for  the  estimated  three-dimensional 
VAR(l)  process  (4.2.2)  based  on  1000  realizations 


sample 

size 

T 

forecast 

horizon 

h 

average  squared  forecast  errors 

VAR(l) 

VAR(3) 

VAR(6) 

yi 

2/2 

2/3 

2/i 

2/2 

2/3 

2/i 

2/2 

2/3 

1 

.87 

1.14 

2.68 

1.14 

1.52 

3.62 

2.25 

2.78 

6.82 

30 

2 

1.09 

1.21 

3.21 

1.44 

1.67 

4.12 

2.54 

2.98 

7.85 

3 

1.06 

1.31 

3.32 

1.35 

1.58 

4.23 

2.59 

2.79 

8.63 

1 

.81 

1.03 

2.68 

.96 

1.22 

2.97 

1.18 

1.53 

3.88 

50 

2 

1.01 

1.23 

2.92 

1.20 

1.40 

3.47 

1.48 

1.68 

4.38 

3 

1.01 

1.29 

3.11 

1.12 

1.44 

3.48 

1.42 

1.77 

4.66 

1 

.73 

.93 

2.35 

.77 

1.00 

2.62 

.86 

1.12 

2.91 

100 

2 

.94 

1.15 

2.86 

1.00 

1.24 

3.12 

1.12 

1.38 

3.53 

3 

.90 

1.15 

3.02 

.93 

1.20 

3.23 

1.03 

1.35 

3.51 

The  question  is  then  what  to  do  if  the  true  order  is  unknown  and  an  upper 
bound,  say  M,  for  the  order  is  known  only.  One  possibility  to  check  whether 
certain  coefficient  matrices  may  be  zero  is  to  set  up  a  significance  test.  For 
our  particular  problem  of  determining  the  correct  VAR  order,  we  may  set 
up  a  sequence  of  tests.  First  :  Am  =  0  is  tested.  If  this  null  hypothesis 
cannot  be  rejected,  we  test  Hq  :  Am- i  =  0  and  so  on  until  we  can  reject  a 
null  hypothesis.  Before  we  discuss  this  procedure  in  more  detail,  we  will  now 
introduce  a  possible  test  statistic. 

4.2.2  The  Likelihood  Ratio  Test  Statistic 

Because  we  just  need  to  test  zero  restrictions  on  the  coefficients  of  a  VAR 
model,  we  may  use  the  Wald  statistic  discussed  in  Section  3.6  in  the  context 
of  causality  tests.  To  shed  some  more  light  on  this  type  of  statistic,  it  may 
be  instructive  to  consider  the  likelihood  ratio  testing  principle.  It  is  based 
on  comparing  the  maxima  of  the  log-likelihood  function  over  the  unrestricted 
and  restricted  parameter  space.  Specifically,  the  likelihood  ratio  statistic  is 


A lr  =  2[ln Z(<5)  -  ln/(d,.)], 


(4.2.3) 
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where  d  is  the  unrestricted  ML  estimator  for  a  parameter  vector  S  obtained 
by  maximizing  the  likelihood  function  over  the  full  feasible  parameter  space 
and  S,  is  the  restricted  ML  estimator  which  is  obtained  by  maximizing  the 
likelihood  function  over  that  part  of  the  parameter  space  where  the  restrictions 
of  interest  are  satisfied  (see  Appendix  C.7).  For  the  case  of  interest  here,  where 
we  have  linear  constraints  for  the  coefficients  of  a  VAR  process,  Alr  can  be 
shown  to  have  an  asymptotic  ^-distribution  with  as  many  degrees  of  freedom 
as  there  are  distinct  linear  restrictions. 

To  obtain  this  result,  let  us  assume  for  the  moment  that  yt  is  a  stable 
Gaussian  (normally  distributed)  VAR(p)  process  as  in  (4.1.1).  Using  the  no¬ 
tation  of  Section  3.2.1  (as  opposed  to  the  mean-adjusted  form  considered  in 
Section  3.4),  the  log- likelihood  function  is 

In  l((3,  Su)  =  - ^  In  2t r  -  ^ln|UM | 

[y  -  (Z'  0  lK)l 3]'  (1T  it  U-1)  [y  -  (Z'  it  IK)0]  (4.2.4) 
(see  (3.4.5)).  The  first  order  partial  derivatives  with  respect  to  (3  are 

f)  1  -r*  / 

—  =  (Z  ®  E-^y  (ZZ'  0  Z-')(3.  (4.2.5) 

Equating  to  zero  and  solving  for  (3  gives  the  unrestricted  ML/LS  estimator 

P=((ZZ')~1ZitlK)  y.  (4.2.6) 

Suppose  the  restrictions  for  (3  are  given  in  the  form 

C(3  =  c,  (4.2.7) 

where  C  is  a  known  (N  x  (. K2p+K ))  matrix  of  rank  N  and  c  is  a  known  (N  x  1) 
vector.  Then  the  restricted  ML  estimator  may  be  found  by  a  Lagrangian 
approach  (see  Appendix  A.  14).  The  Lagrange  function  is 

£(/3,7)  =  ln«(/3)+7,(C,/3-c),  (4.2.8) 

where  7  is  an  (N  x  1)  vector  of  Lagrange  multipliers.  Of  course,  C  also  depends 
on  Eu.  Because  these  parameters  are  not  involved  in  the  restrictions  (4.2.7), 
we  have  skipped  them  there.  The  restricted  maximum  of  the  log-likelihood 
function  with  respect  to  (3  is  known  to  be  attained  at  a  point  where  the  first 
order  partial  derivatives  of  C  are  zero. 

dC 

—  =  (Z  0  E-^y  -  (ZZ1  0  U-1)/?  +  C'7,  (4.2.9a) 


dC 

dj 


Cf3  —  c. 


(4.2.9b) 
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Equating  to  zero  and  solving  gives 

Pr  =  P  +  [{ZZ'Y1  ®  £u ]  c'  [Ci(ZZ')-1  ®  Zu)C']  _1  (c  -  C] 3)  (4.2.10) 
(see  Problem  4.1). 

Because  for  any  given  coefficient  matrix  B°  the  maximum  of  In  l  with 
respect  to  £u  is  obtained  for 

£°u  =  ^Y-B°Z)(Y-B°Zy 

(see  Section  3.4.2,  (3.4.8)  and  (3.4.11)),  the  maximum  for  the  unrestricted 
case  is  attained  for 

£u  =  ^(Y-BZ)(Y-BZy  (4.2.11) 

and  for  the  restricted  case  we  get 

K  =  lT(y  -  BrZ)(Y  -  Brzy.  (4.2.12) 

Here  B  and  Br  are  the  coefficient  matrices  corresponding  to  /3  and  /3r,  re¬ 
spectively,  that  is,  P  =  vec (B)  and  Pr  =  vec (B.r).  Thus,  for  this  particular 
situation,  the  likelihood  ratio  statistic  becomes 

XLR  =  2[lnl(p,£u)-lnl(Pr,K)\- 

This  statistic  can  be  shown  to  have  an  asymptotic  x2(lV)-distribution.  In 
fact,  this  result  also  holds  if  yt  is  not  Gaussian,  but  has  a  distribution  from 
a  larger  family.  If  yt  is  not  Gaussian,  the  estimators  obtained  by  maximizing 
the  Gaussian  likelihood  function  in  (4.2.4)  are  called  quasi  ML  estimators.  We 
will  now  state  the  previous  results  formally  and  then  present  a  proof. 

Proposition  4.1  ( Asymptotic  Distribution  of  the  LR  Statistic) 

Let  yt  be  a  stationary,  stable  VAR(p)  process  as  in  (4.1.1)  with  standard  white 
noise  ut  (see  Definition  3.1).  Suppose  the  true  parameter  vector  /3  satisfies 
linear  constraints  CP  =  c,  where  C  is  an  ( N  x  ( K2p  +  K))  matrix  of  rank 
N  and  c  is  an  (N  x  1)  vector.  Moreover,  let  In  l  denote  the  Gaussian  log- 
likelihoocl  function  and  let  P  and  Pr  be  the  (quasi)  ML  and  restricted  (quasi) 
ML  estimators,  respectively,  with  corresponding  estimators  £u  and  £))  of  the 
white  noise  covariance  matrix  £u  given  in  (4.2.11)  and  (4.2.12).  Then 

A lh  =  2  |ln  l(f3,  Eu)  —  In  l(Pr,  ££) 

=  T(ln\£ra\-\n\£u\)  (4.2.13a) 

=  (Pr-P)\ZZ'  ®£~l)CPr-~P)  _  (4.2.13b) 

=  CPr-~P)'{ZZ' ®{£l)-l)CPr-~P)+op{l)  (4.2.13c) 

=  {CP  -  c)'  \cazz1)-1  <g>  £U)C']  \cp-c)  +  op{  1)  (4.2.13d) 

=  {CP  -  c)'  \cifZZ')-1  ®  £'U)C']  1  ( Cp  -  c)  +  op{  1)  (4.2.13e) 
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and 


Xlr^X2(N). 


Here  T  is  the  sample  size  (time  series  length)  and  Z  :=  (ZUt . . . ,  Zt-i)  with 
:=  (i) y'ti  ■  ■  ■  > v't-p+i)-  ® 

In  this  proposition,  the  quantity  op(  1)  denotes  a  sequence  which  converges 
to  zero  in  probability  when  the  sample  size  T  — >  oo  (see  Appendix  C.2).  Note 
that  yt  is  not  assumed  to  be  Gaussian  (normally  distributed)  in  the  propo¬ 
sition.  It  is  just  assumed  that  ut  is  independent  white  noise  with  bounded 
fourth  moments.  Thus,  hil  may  not  really  be  the  log-likelihood  function  of  y 
:=  vec(j/i, . . . ,  yr)-  It  will  only  be  the  actual  log-likelihood  if  y  happens  to  be 
multivariate  normal.  In  that  case,  j3  and  (3r  are  actual  ML  and  restricted  ML 
estimators.  Otherwise  they  are  quasi  ML  estimators. 

The  second  form  of  the  LR  statistic  in  (4.2.13a)  is  sometimes  convenient  for 
computing  the  actual  test  value.  It  is  also  useful  for  comparing  the  likelihood 
ratio  tests  to  other  procedures  for  VAR  order  selection,  as  we  will  see  in  Section 
4.3.  The  expression  in  (4.2.13b)  shows  the  similarity  of  the  LR  statistic  to  the 
LM  statistic  given  in  (4.2.13c).  Using  (4.2.5)  and 


d2  In  l 

d(3df3 ' 


-(zz'^s-1) 


gives 


A  LM 


d\nl((3r)  d2  In Z (/3r )  dlnl{pr) 
D/3'  df3d/3'  d& 


^-^'(ZZ'gK)-1)^-/?) 


(see  Appendix  C.7  and  Problem  4.5).  Notice  that  in  the  present  case  we  may 
ignore  the  part  of  the  parameter  vector  which  corresponds  to  Eu  because  its 
ML  estimator  is  asymptotically  independent  of  the  other  parameters  and  the 
asymptotic  covariance  matrix  is  block-diagonal.  Therefore,  at  least  asymptot¬ 
ically,  the  terms  related  to  scores  of  the  covariance  parameters  vanish  from 
the  LM  statistic. 

Comparing  (4.2.13d)  to  (3.6.5)  shows  that,  for  the  special  case  consid¬ 
ered  here,  the  LR  statistic  is  also  similar  to  the  Wald  statistic.  In  fact,  the 
important  difference  between  the  Wald  and  LR  statistics  is  that  the  former 
involves  only  estimators  of  the  unrestricted  model  while  both  unrestricted  and 
restricted  estimators  enter  into  A lr  (see  also  (4.2.13a)).  The  final  form  of  the 
statistic  given  in  (4.2. 13e)  provides  another  useful  expression  which  is  close  to 
both  the  LR  and  the  LM  statistic.  It  shows  that  we  may  use  the  covariance 
matrix  estimator  from  the  restricted  model  instead  of  the  unrestricted  one. 

As  in  the  case  of  the  Wald  test,  one  may  consider  using  the  statistic  A lr/N 
in  conjunction  with  the  F(N,  T  —  Kp  —  l)-distribution  in  small  samples.  An¬ 
other  adjustment  was  suggested  by  Hannan  (1970,  p.  341). 
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Proof  of  Proposition  f.l: 

We  first  show  the  equivalence  of  the  various  forms  of  the  LR  statistic  given  in 
the  proposition.  The  equality  in  (4.2.13a)  follows  by  noting  that 

[y  -  (Z'  ®  IK)0\'  (It  ®  A"1)  [y  -  (Z'  ®  IK)(3\ 

=  tr  ((Y-BZyS-^Y-BZ)] 

=  tr [Sff\Y  -  BZ)(Y  -  BZ)'}. 


Replacing  the  matrices  B  and  Su  by  B  and  Su,  respectively,  gives 

~  ~  T  ~ 

In  l(P,  Bu)  =  constant  —  —  In  \EU\. 

Similarly, 


In  l(/3r,  Sra)  =  constant  —  ^  In  |27£|, 


which  gives  the  desired  result. 

In  order  to  prove  (4.2.13b),  we  observe  that  InZ  is  a  quadratic  function 
in  /3.  Thus,  by  Taylor’s  theorem  (Appendix  A. 13,  Proposition  A. 3),  for  an 
arbitrary  fixed  vector  /3°, 


In  l((3)  =  In  Z(/3°)  + 


9  In  l((3°) 
dp' 


(P  ~  P°) 


+  o(/3-/3°) 


,d2  lnZ(/3°) 
dpd/3' 


(/3-/30). 


Choosing  (3  for  f3°  and  /3r  for  (3,  d  \\d((3)  /  8(3'  =  0  so  that 


A  lr  =  2 


\nl((3)-\nl((3r)  =  — (/3r  —  f3) 


vd2\n  l((3) 
d/3dl3' 


(Pr  ~  P)-  (4-2.14) 


As  in  Section  3.4,  we  can  derive 


d 2  In  l 

dpdp' 


-(ZZ'^Z-1). 


Hence,  (4.2.13b)  follows  and  (4.2.13c)  is  an  immediate  consequence  of  the 
fact  that  the  restricted  and  unrestricted  covariance  matrix  estimators  are 
consistent  by  Proposition  3.2.  Thus,  plim(Z'M  —  Sff)  =  0  which  can  be  used  to 
show  (4.2.13c). 

Using  (4.2.10)  and  (4.2.14)  gives 

A  lr  =  (CP-cyiC^ZZ')-1®^'}-1 

xCdZZ')-1  ®  SU)(ZZ'  ®  E-^dZZ')-1  ®  ZU)C' 
x  [CdZZ')-1  #  EU)C'} _1  ( CP  -  c). 
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The  result  (4.2.13d)  follows  by  replacing  Su  with  Su  and  noting  that  this  is  a 
consistent  estimator  of  Uu.  Again  by  consistency  of  using  this  estimator 
instead  of  Su  changes  the  statistic  only  by  a  term  which  vanishes  in  probability 
as  the  sample  size  increases.  Hence,  we  have  (4.2. 13e). 

The  asymptotic  y2(iV)-distribution  of  A lr  now  follows  from  Proposition 
C.15(5)  of  Appendix  C  because  [C((ZZr /T)_1  <g>  A7(()C"]-1  is  a  consistent 
estimator  of  [CI(T'_1  ®  ■ 

In  the  next  subsection  a  sequential  testing  scheme  based  on  LR  tests  is 
discussed. 

4.2.3  A  Testing  Scheme  for  VAR  Order  Determination 

Assuming  that  M  is  known  to  be  an  upper  bound  for  the  VAR  order,  the 
following  sequence  of  null  and  alternative  hypotheses  may  be  tested  using  LR 
tests: 


tfo1 

Am  =  0 

versus 

Hi 

Am  ^  0 

Ho 

Am-i  =  0 

versus 

H'l 

Am- i  ^  0  \Am  =  0 

Hb 

Am-i+i  =  0 

versus 

H\ 

Am—x+i  0 

\Am  =  •  •  •  =  Am-i+2  =  0 

K 

o 

II 

versus 

H™ 

A\  ^  0  \Am  =  ■  •  •  =  A2  =  0 

(4.2.15) 

In  this  scheme,  each  null  hypothesis  is  tested  conditionally  on  the  previous 
ones  being  true.  The  procedure  terminates  and  the  VAR  order  is  chosen  ac¬ 
cordingly,  if  one  of  the  null  hypotheses  is  rejected.  That  is,  if  H ^  is  rejected, 
p  =  M  —  i  +  1  will  be  chosen  as  estimate  of  the  autoregressive  order. 

The  likelihood  ratio  statistic  for  testing  the  i-th  null  hypothesis  is 

A lr{i)  =  T[ln\Su(M  -  i) |  -  ln| EU(M  -  i  +  1)|],  (4.2.16) 

where  £ u(m)  denotes  the  ML  estimator  of  Su  when  a  VAR(?n)  model  is 
fitted  to  a  time  series  of  length  T.  By  Proposition  4.1,  this  statistic  has  an 
asymptotic  y2  ( if 2  )-distribution  if  A/,j  and  all  previous  null  hypotheses  are 
true.  Note  that  K2  parameters  are  set  to  zero  in  Hq.  Hence,  we  have  to  test 
K 2  restrictions  and  we  use  Ai/j(i)  in  conjunction  with  critical  values  from 
a  y2  (A-2  (-distribution.  Alternatively,  one  may  use  \lr(i)  /  K2  in  conjunction 
with  the  F(K2,  T  —  K(M  —  i  +  1)  —  l)-distribution. 

Of  course,  the  order  chosen  for  a  particular  process  will  depend  on  the 
significance  levels  used  in  the  tests.  In  this  procedure,  it  is  important  to  realize 
that  the  significance  levels  of  the  individual  tests  must  be  distinguished  from 
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the  Type  I  error  of  the  whole  procedure  because  rejection  of  Hq  means  that 
. . . ,  H^4  are  automatically  rejected  too.  Thus,  denoting  by  Dj  the  event 
that  C/y  is  rejected  in  the  j-th  test  when  it  is  actually  true,  the  probability  of 
a  Type  I  error  for  the  z-th  test  in  the  sequence  is 

£i  =  Pr(  A  U  D2  U  •  •  •  U  A)- 

Because  Dj  is  the  event  that  XlrU)  falls  in  the  rejection  region,  although 
HJ0  is  true,  7 3  =  Pr (A)  is  just  the  significance  level  of  the  j-ih  individual 
test.  It  can  be  shown  that  for  m  7^  j  and  m,j  <  i,XLR(m)  and  A lr(J)  are 
asymptotically  independent  statistics  if  A] ,  . . . ,  //(j  are  true  (see  Paulsen  & 
Tjpstheim  (1985,  pp.  223-224)).  Hence,  Dm  and  Dj  are  independent  events 
in  large  samples  so  that 

€i  =  Pr(A  U  ■  •  -  U  A-i)  +  Pr(A)  -  Pr{(A  U  ■  •  •  U  A-i)  H  A} 

=  Ci_i  +  7,  —  ei-iji  =  Ci-i  +  7i(l;W  e,_i),  i  =  2, 3, . . .  ,M.  (4.2.17) 

Of  course,  ei  =71.  Thus,  it  is  easily  seen  by  induction  that 

=  1  —  (1  —  7i)  ■■■(!  —  7«),  i  =  1,2,...,  M.  (4.2.18) 

If,  for  example,  a  5%  significance  level  is  chosen  for  each  individual  test  (7^  = 
.05),  then 


£l  =  .05,  e2  =  1  -  .95  x  .95  =  .0975,  e3  =  .142625. 

Hence,  the  actual  rejection  probability  will  become  quite  substantial  if  the 
sequence  of  null  hypotheses  to  be  tested  is  long. 

It  is  difficult  to  decide  on  appropriate  significance  levels  in  the  testing 
scheme  (4.2.15).  Whatever  significance  levels  the  researcher  decides  to  use, 
she  or  he  should  keep  in  mind  the  distinction  between  the  overall  and  the 
individual  significance  levels.  Also,  it  must  be  kept  in  mind  that  we  know  the 
asymptotic  distributions  of  the  LR  statistics  only.  Thus,  the  significance  levels 
chosen  will  be  approximate  probabilities  of  Type  I  errors  only. 

Finally,  in  the  literature  another  testing  scheme  was  also  suggested  and 
used.  In  that  scheme  the  first  set  of  hypotheses  ( i  =  1)  is  as  in  (4.2.15)  and 
for  i  >  1  the  following  hypotheses  are  tested: 

Hl0  :  Am  =  ■■■  =  AM-i+i  =  0  versus  H[  :  AM  7^  0  or ...  or  AM-i+ 1  ^  0. 

Here  H is  not  tested  conditionally  on  the  previous  null  hypotheses  being 
true  but  it  is  tested  against  the  full  VAR(M)  model.  Unfortunately,  the  LR 
statistics  to  be  used  in  such  a  sequence  will  not  be  independent  so  that  the 
overall  significance  level  (probability  of  Type  I  error)  is  difficult  to  determine. 
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4.2.4  An  Example 

To  illustrate  the  sequential  testing  procedure  described  in  the  foregoing,  we  use 
the  investment/income/consumption  example  from  Section  3.2.3.  The  vari¬ 
ables  yi,  j/2,  and  2/3  represent  first  differences  of  the  logarithms  of  the  invest¬ 
ment,  income,  and  consumption  data.  We  assume  an  upper  bound  of  M  =  4 
for  the  VAR  order  and  therefore  we  set  aside  the  first  4  values  as  presample 
values.  The  data  up  to  1978.4  are  used  for  estimation  so  that  the  sample  size 
is  T  =  71  in  each  test.  The  estimated  error  covariance  matrices  and  their 
determinants  are  given  in  Table  4.3.  The  corresponding  y2-  and  E-test  values 
are  summarized  in  Table  4.4.  Because  the  denominator  degrees  of  freedom 
for  the  E-statistics  are  quite  large  (ranging  from  62  to  70),  the  E-tests  are 
qualitatively  similar  to  the  y2-tests.  Using  individual  significance  levels  of  .05 
in  each  test,  Hq  :  A2  =  0  is  the  first  null  hypothesis  that  is  rejected.  Thus, 
the  estimated  order  from  both  tests  is  p  =  2.  This  supports  the  order  chosen 
in  the  example  in  Chapter  3.  Alternative  procedures  for  choosing  VAR  orders 
are  considered  in  the  next  section. 


Table  4.3.  ML  estimates  of  the  error  covariance  matrix 
of  the  investment/income/consumption  system 


VAR 

order 

777- 

O 

X 

4 

|  S-u  (jn)  |  x  1011 

"  21.83  .410  1.228  ' 

0 

1.420  .571 

1.084  _ 

"  20.14  .493  1.173  " 

2.473 

1 

1.318  .625 

1.018  _ 

'  19.18  .617  1.126  ' 

1.782 

2 

1.270  .574 

.821 

'  19.08  .599  1.126  " 

1.255 

3 

1.235  .543 

.784  _ 

'  16.96  .573  1.252  " 

1.174 

4 

1.234  .544 

.765 

.958 
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Table  4.4.  LR  statistics  for  the  investment/income/consumption  system 


i 

VAR  order 
under  Hq 

A  LR3, 

Ai,ij/9b 

i 

A4  =  0 

3 

14.44 

1.60 

2 

A3  =  0 

2 

4.76 

.53 

3 

A2  =  0 

1 

24.90 

2.77 

4 

Ai  =  0 

0 

23.25 

2.58 

a  Critical  value  for  individual  5%  level  test:  x2(9).95  =  16.92. 
b  Critical  value  for  individual  5%  level  test:  F( 9,  71  —  3(5  —  i)  —  1).95  «  2 


4.3  Criteria  for  VAR  Order  Selection 

Although  performing  statistical  tests  is  a  common  strategy  for  detecting 
nonzero  parameters,  the  approach  described  in  the  previous  section  is  not  com¬ 
pletely  satisfactory  if  a  model  is  desired  for  a  specific  purpose.  For  instance, 
a  VAR  model  is  often  constructed  for  prediction  of  the  variables  involved. 
In  such  a  case,  we  are  not  so  much  interested  in  finding  the  correct  order  of 
the  underlying  data  generation  process  but  we  are  interested  in  obtaining  a 
good  model  for  prediction.  Hence,  it  seems  useful  to  take  the  objective  of  the 
analysis  into  account  when  choosing  the  VAR  order.  In  the  next  subsection, 
we  will  discuss  criteria  based  on  the  forecasting  objective. 

If  we  really  want  to  know  the  exact  order  of  the  data  generation  process 
(e.g.,  for  analysis  purposes)  it  is  still  questionable  whether  a  testing  procedure 
is  the  optimal  strategy  because  that  strategy  has  a  positive  probability  of 
choosing  an  incorrect  order  even  if  the  sample  size  (time  series  length)  is  large 
(see  Section  4.3.3).  In  Section  4.3.2  we  will  present  estimation  procedures  that 
choose  the  correct  order  with  probability  1  at  least  in  large  samples. 


4.3.1  Minimizing  the  Forecast  MSE 


If  forecasting  is  the  objective,  it  makes  sense  to  choose  the  order  such  that  a 
measure  of  forecast  precision  is  minimized.  The  forecast  MSE  (mean  squared 
error)  is  such  a  measure.  Therefore  Akaike  (1969,  1971)  suggested  to  base  the 
VAR  order  choice  on  the  approximate  1-step  ahead  forecast  MSE  given  in 
Chapter  3,  (3.5.13), 


^y(l)  = 


T  +  Km.  +  1 
T 


A„ 


where  m  denotes  the  order  of  the  VAR  process  fitted  to  the  data,  T  is  the 
sample  size,  and  K  is  the  dimension  of  the  time  series.  To  make  this  criterion 
operational,  the  white  noise  covariance  matrix  Ku  has  to  be  replaced  by  an 
estimate.  Also,  to  obtain  a  unique  solution  we  would  like  to  have  a  scalar 
criterion  rather  than  a  matrix.  Akaike  suggested  using  the  LS  estimator  with 
degrees  of  freedom  adjustment, 
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Su{m) 


T 


T  -  Km  -  1 


Ku{m), 


for  Su  and  taking  the  determinant  of  the  resulting  expression.  Here  £u(m) 
is  the  ML  estimator  of  Su  obtained  by  fitting  a  VAR(to)  model,  as  in  the 
previous  section.  The  resulting  criterion  is  called  the  final  prediction  error 
(FPE)  criterion,  that  is, 


FPE  (to)  =  clet 


T  +  Km  +  1 


T 


T  T  -  Km  -  1 

l  K 

det  Eu(m). 


Su{m) 


T  +  Km  +  1 
T  -  Km  -  1 


(4.3.1) 


We  have  written  the  criterion  in  terms  of  the  ML  estimator  of  the  covariance 
matrix  because  in  this  form  the  FPE  criterion  has  intuitive  appeal.  If  the 
order  m  is  increased,  det  Su  (m)  declines  while  the  multiplicative  term  (T  + 
Km  +  1  )/(T  —  Km  —  1)  increases.  The  VAR  order  estimate  is  obtained  as 
that  value  for  which  the  two  forces  are  balanced  optimally.  Note  that  the 
determinant  of  the  LS  estimate  Ku(m)  may  increase  with  increasing  m.  On 
the  other  hand,  it  is  quite  obvious  that  \ZJu(m)\  cannot  become  larger  when  m 
increases  because  the  maximum  of  the  log-likelihood  function  is  proportional 
to  — ln|L,.tl(?Ti)|  apart  from  an  additive  constant  and,  for  m  <  n,  a  VAR(to) 
model  may  be  interpreted  as  a  restricted  VAR(?r)  model.  Thus,  —  ln|T'„(?n)|  < 
-ln|£u(n)|  or  \£u(m)\  >  \Su(n)\. 

Based  on  the  FPE  criterion,  the  estimate  p(FPE)  of  p  is  chosen  such  that 


FPE[p(FPE)]  =  min{FPE(m)|TO  =  0,1,...,  M}. 


That  is,  VAR  models  of  orders  m  =  0, 1, . . . ,  M  are  estimated  and  the  corre¬ 
sponding  FPE  (to)  values  are  computed.  The  order  minimizing  the  FPE  values 
is  then  chosen  as  estimate  for  p. 

Akaike  (1973,  1974),  based  on  a  quite  different  reasoning,  derived  a  very 
similar  criterion  usually  abbreviated  by  AIC  (Akaike’s  Information  Criterion). 
For  a  VAR  (to)  process  the  criterion  is  defined  as 
~  2 

AIC  (to)  =  In  |  Su  (to)  |  +  —(number  of  freely  estimated  parameters) 

9 77?  ^ 

=  In 1 27,, (m)  |  H  —  ■  (4.3.2) 

The  estimate  p(AIC)  for  p  is  chosen  so  that  this  criterion  is  minimized.  Here 
the  constants  in  the  VAR  model  may  be  ignored  as  freely  estimated  parameters 
because  counting  them  would  just  add  a  constant  to  the  criterion  which  does 
not  change  the  minimizing  order. 

The  similarity  of  the  criteria  AIC  and  FPE  can  be  seen  by  noting  that, 
for  a  constant  N, 
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The  quantity  0(T~2)  denotes  a  sequence  of  order  T-2,  that  is,  a  sequence 
indexed  by  T  that  remains  bounded  if  multiplied  by  T 2  (see  Appendix  C.2). 
Thus,  the  sequence  goes  to  zero  rapidly  when  T  — >  oo.  Hence, 


In  FPE(m) 


In | Su(m)\  +  K In  [(T  +  Km  +  1  )/(T  -  Km  -  1)] 

In \Su{m)\  +  A' In  [l  +  2 {Km  +  1  )/T  +  0(T~ 2)] 

In  \Eu{m)\  +  K2(A^+1)  +  0(T"2) 

AIC(m)  +  2K/T  +  0{T~2).  (4.3.3) 


The  third  equality  sign  follows  from  a  Taylor  series  expansion  of  ln(l  +  x) 
around  1.  The  term  2 K/T  does  not  depend  on  the  order  m  and,  hence,  AIC  (to) 
+  2 K/T  and  AIC(to)  assume  their  minimum  for  the  same  value  of  m.  Con¬ 
sequently,  AIC  and  In  FPE  differ  essentially  by  a  term  of  order  0(T~2)  and, 
thus,  the  two  criteria  will  be  about  equivalent  for  moderate  and  large  T . 

To  illustrate  these  procedures  for  VAR  order  selection,  we  use  again  the 
investment/income/consumption  example.  The  determinants  of  the  residual 
covariance  matrices  are  given  in  Table  4.3.  Using  these  determinants,  the  FPE 
and  AIC  values  presented  in  Table  4.5  are  obtained.  Both  criteria  reach  their 
minimum  for  p  =  2,  that  is,  p(FPE)  =  p(AIC)  =  2.  The  other  quantities  given 
in  the  table  will  be  discussed  shortly. 


Table  4.5.  Estimation  of  the  VAR  order  of  the  investment/income/consumption 
system 


VAR  order 

m 

FPE(m)  xlO11 

AIC(m) 

HQ(m) 

SC(m) 

0 

2.691 

-24.42 

-24.42* 

-24.42* 

1 

2.500 

-24.50 

-24.38 

-24.21 

2 

2.272* 

-24.59* 

-24.37 

-24.02 

3 

2.748 

-24.41 

-24.07 

-23.55 

4 

2.910 

-24.36 

-23.90 

-23.21 

*  Minimum. 


4.3.2  Consistent  Order  Selection 

If  interest  centers  on  the  correct  VAR  order,  it  makes  sense  to  choose  an 
estimator  that  has  desirable  sampling  properties.  One  problem  of  interest  in 
this  context  is  to  determine  the  statistical  properties  of  order  estimators  such 
as  p(FPE)  and  p(AIC).  Consistency  is  a  desirable  asymptotic  property  of  an 
estimator.  As  usual,  an  estimator  p  of  the  VAR  order  p  is  called  consistent  if 

plim  p  =  p  or,  equivalently,  lim  Pr{p  =  p}  =  1.  (4.3.4) 

T — KX>  1  >0° 
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The  latter  definition  of  the  plim  may  seem  to  differ  slightly  from  the  one 
given  in  Appendix  C.  However,  it  is  easily  checked  that  the  two  definitions 
are  equivalent  for  integer  valued  random  variables.  Of  course,  a  reasonable 
estimator  for  p  should  be  integer  valued.  The  estimator  p  is  called  strongly 
consistent  if 


Pr{lim  p  =  p}  =  1.  (4.3.5) 

Accordingly,  a  VAR  order  selection  criterion  will  be  called  consistent  or 
strongly  consistent  if  the  resulting  estimator  has  these  properties.  The  follow¬ 
ing  proposition  due  to  Hannan  &  Quinn  (1979),  Quinn  (1980),  and  Paulsen 
(1984)  is  useful  for  investigating  the  consistency  of  criteria  for  order  determi¬ 
nation. 

Proposition  4.2  ( Consistency  of  VAR  Order  Estimators) 

Let  yt  be  a  A'-dimensional  stationary,  stable  VAR(p)  process  with  standard 
white  noise  (that  is,  Ut  is  independent  white  noise  with  bounded  fourth  mo¬ 
ments)  .  Suppose  the  maximum  order  M  >  p  and  p  is  chosen  so  as  to  minimize 
a  criterion 

Cr(m)  =  ln|i7u(m)|  +  mcr/T  (4.3.6) 

over  m  =  0, 1, ... ,  M.  Here  Su(m)  denotes  the  (quasi)  ML  estimator  of  Eu 
obtained  for  a  VAR(?n)  model  and  ct  is  a  nondecreasing  sequence  of  real 
numbers  that  depends  on  the  sample  size  T .  Then  p  is  consistent  if  and  only 
if 


ct  — >  o o  and  ct/T  — >  0  as  T  — »  oo.  (4.3.7a) 

The  estimator  p  is  a  strongly  consistent  estimator  if  and  only  if  (4.3.7a)  holds 
and 

cT/21nlnT>l  (4.3.7b) 

eventually,  as  T  — »  oo.  ■ 

We  will  not  prove  this  proposition  here  but  refer  the  reader  to  Quinn 
(1980)  and  Paulsen  (1984)  for  proofs.  The  basic  idea  of  the  proof  is  to  show 
that,  for  p  >  m,  the  quantity  In  |  Eu (to)  | /  In  |  Eu (p)  |  will  be  greater  than  one 
in  large  samples  because  In  Ev  (rn)  is  essentially  the  minimum  of  minus  the 
Gaussian  log-likelihood  function  for  a  VAR(m)  model.  Consequently,  because 
the  penalty  terms  mcx/T  and  pct/T  go  to  zero  as  T  — >  oo,  Cr(m)  >  Cr(p)  for 
large  T.  Thus,  the  probability  of  choosing  too  small  an  order  goes  to  zero  as 
T  — >  oo.  Similarly,  if  to  >  p,  In  \Eu(m)\/ln\£u(p)\,  approaches  1  in  probability 
if  T  — >  oo  and  the  penalty  term  of  the  lower  order  model  is  smaller  than  that 
of  a  larger  order  process.  Thus  the  lower  order  p  will  be  chosen  if  the  sample 
size  is  large.  The  following  corollary  is  an  easy  consequence  of  the  proposition. 
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Corollary  4.2.1 

Under  the  conditions  of  Proposition  4.2,  if  M  >  p ,  p(FPE)  and  p(AIC)  are 
not  consistent .  ■ 


Proof:  Because  FPE  and  AIC  are  asymptotically  equivalent  (see  (4.3.3)),  it 
suffices  to  prove  the  corollary  for  p(AIC).  Equating  AIC(ra)  and  Cr(m)  given 
in  (4.3.6)  shows  that 

2  mK2  /T  =  mcr/T 


or  ct  =  2K2.  Obviously,  this  sequence 


We  will  see  shortly  that  the  limiting  probability  for  underestimating  the 
VAR  order  is  zero  for  both  p(AIC)  and  p(FPE)  so  that  asymptotically  they 
overestimate  the  true  order  with  positive  probability.  However,  Paulsen  & 
Tjostheim  (1985,  p.  224)  argued  that  the  limiting  probability  for  overesti¬ 
mating  the  order  declines  with  increasing  dimension  K  and  is  negligible  for 
K  >  5.  In  other  words,  asymptotically  AIC  and  FPE  choose  the  correct  order 
almost  with  probability  one  if  the  underlying  multiple  time  series  has  large 
dimension  K. 

Before  we  continue  the  investigation  of  AIC  and  FPE,  we  shall  introduce 
two  consistent  criteria  that  have  been  quite  popular  in  recent  applied  work. 
The  first  one  is  due  to  Hannan  &  Quinn  (1979)  and  Quinn  (1980).  It  is  often 
denoted  by  HQ  {Hannan- Quinn  criterion ): 


HQ(?n) 


In  \Eu(m)  |  + 
In  \Hu(jn)  |  + 


2  In  In  T 
T 

2  In  In  T 
T 


(ff  freely  estimated  parameters) 
mK2 . 


(4.3.8) 


The  estimate  p(HQ)  is  the  order  that  minimizes  HQ(m)  for  m  =  0, 1, ... ,  M. 
Comparing  this  criterion  to  (4.3.6)  shows  that  ct  =  2A'2lnlnT  and,  thus,  by 
(4.3.7a),  HQ  is  consistent  for  univariate  processes  and  by  (4.3.7b)  it  is  strongly 
consistent  for  K  >  1,  if  the  conditions  of  Proposition  4.2  are  satisfied  for  yt. 
Using  Bayesian  arguments  Schwarz  (1978)  derived  the  following  criterion: 


SC(m)  =  In  \  Hu(m)  \  H — —  (#  freely  estimated  parameters) 

~  InT 

=  In  \£u(m)\  H — —mK2 .  (4.3.9) 

Again  the  order  estimate  p{ SC)  is  chosen  so  as  to  minimize  the  value  of  the 
criterion.  A  comparison  with  (4.3.6)  shows  that  for  this  criterion  ct  =  K2  In  T. 
Because 


A'2  In  T/2  In  In  T 

approaches  infinity  for  T  — >  oo,  (4.3.7b)  is  satisfied  and  SC  is  seen  to  be 
strongly  consistent  for  any  dimension  K. 
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Corollary  4.2.2 

Under  the  conditions  of  Proposition  4.2,  SC  is  strongly  consistent  and  HQ  is 
consistent.  If  the  dimension  K  of  the  process  is  greater  than  one,  both  criteria 
are  strongly  consistent.  ■ 

In  Table  4.5,  the  values  of  HQ  and  SC  for  the  investment /income/con¬ 
sumption  example  are  given.  Both  criteria  assume  the  minimum  for  m  =  0, 
that  is,  p(HQ)  =  p(SC)  =  0. 

4.3.3  Comparison  of  Order  Selection  Criteria 

It  is  worth  emphasizing  that  the  foregoing  results  do  not  necessarily  mean  that 
AIC  and  FPE  are  inferior  to  HQ  and  SC.  Only  if  consistency  is  the  yardstick 
for  evaluating  the  criteria,  the  latter  two  are  superior  under  the  conditions  of 
the  previous  section.  So  far  we  have  not  considered  the  small  sample  properties 
of  the  estimators.  In  small  samples,  AIC  and  FPE  may  have  better  properties 
(choose  the  correct  order  more  often)  than  HQ  and  SC.  Also,  the  former 
two  criteria  are  designed  for  minimizing  the  forecast  error  variance.  Thus,  in 
small  as  well  as  large  samples,  models  based  on  AIC  and  FPE  may  produce 
superior  forecasts  although  they  may  not  estimate  the  orders  correctly.  In  fact, 
Shibata  (1980)  derived  asymptotic  optimality  properties  of  AIC  and  FPE  for 
univariate  processes.  He  showed  that,  under  suitable  conditions,  they  indeed 
minimize  the  1-step  ahead  forecast  MSE  asymptotically. 

Although  it  is  difficult  in  general  to  derive  small  sample  properties  of 
the  criteria,  some  such  properties  can  be  obtained.  The  following  proposition 
states  small  sample  relations  between  the  criteria. 

Proposition  4.3  ( Small  Sample  Comparison  of  AIC,  HQ,  and  SC) 

Let  U-m+ i,  .  •  • ,  3/0)  2/i)  •  •  • )  Ut  be  any  A'-dirriensional  multiple  time  series  and 
suppose  that  VAR(to)  models,  m  =  0, 1, ... ,  M,  are  fitted  to  j/i, . . . ,  yx-  Then 
the  following  relations  hold: 


p(SC)  <  F(AIC) 

if  T  > 8, 

(4.3.10) 

p(SC)  <  p( HQ) 

for  all  T, 

(4.3.11) 

p(HQ)  <  p(AIC) 

if  T  >  16. 

(4.3.12) 

■ 

Note  that  we  do  not  require  stationarity  of  yt-  In  fact,  we  do  not  even 
require  that  the  multiple  time  series  is  generated  by  a  VAR  process.  Moreover, 
the  proposition  is  valid  in  small  samples  and  not  just  asymptotically.  The  proof 
is  an  easy  consequence  of  the  following  lemma. 

Lemma  4.1 

Let  ag,ai, ...  ,om,  6q,  b\,  . . . ,  6m  and  Cq, Ci,  . . . , Cm  be  real  numbers.  If 


bm+1  bm  ^  Q"m+ 1  Um, 


to  =  0, 1, . . . ,  M  —  1 


(4.3.13a) 
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holds  and  if  nonnegative  integers  n  and  k  are  chosen  such  that 

cn  +  an  =  min{cTO  +  am \m  =  0, 1, . . 

,.,M} 

(4.3.13b) 

and 

Ck  +  bk  =  min{cm  +  bm\m  =  0, 1, . . 

,,M}, 

(4.3.13c) 

then  k  >  n.1 

■ 

The  proof  of  this  lemma  is  left  as  an  exercise  (see  Problem  4.2).  It  is  now 
easy  to  prove  Proposition  4.3. 


Proof  of  Proposition  f.3: 

Let  cm  =  In  \Su{m)\ ,  bm  =  2 mK2/T  and  am  =  mK 2  In T/T.  Then  AIC(m)  = 
cm  +  brn  and  SC  (to)  =  crn  +  am.  The  sequences  aTO,  bm,  and  crn  satisfy  the 
conditions  of  the  lemma  if 

2 K2/T  =  2 (to  +  1  )K2/T  -  2 mI<2/T  =  bm+1  -  bm 

<  am+ 1  -  am  =  (to  +  1)K2  In  T/T  -  mK 2  In  T/T  =  K2  In  T/T 

or,  equivalently,  if  In  T  >  2  or  T  >  e2  =  7.39.  Hence,  choosing  k  =  p(AIC) 
and  n  =  p( SC)  gives  p(SC)  <  p(AIC)  if  T  >  8.  The  relations  (4.3.11)  and 
(4.3.12)  can  be  shown  analogously.  ■ 

An  immediate  consequence  of  Corollary  4.2.1  and  Proposition  4.3  is  that 
AIC  and  FPE  asymptotically  overestimate  the  true  order  with  positive  prob¬ 
ability  and  underestimate  the  true  order  with  probability  zero. 

Corollary  4.3.1 

Under  the  conditions  of  Proposition  4.2,  if  M  >  p, 

lim  Pr{p(AIC)  <  p}  =  0  and  lim  Pr{p(AIC)  >  p}  >  0  (4.3.14) 

T  — >oo 

and  the  same  holds  for  p(FPE).  ■ 

Proof:  By  (4.3.10)  and  Corollary  4.2.2, 

Pr{p(AIC)  <  p}  <  Pr{p(SC)  <  p}  — »  0. 

Because  AIC  is  not  consistent  by  Corollary  4.2.1,  lim  Pr{p(AIC)  =  p}  < 
1.  Hence  (4.3.14)  follows.  The  same  holds  for  FPE  because  this  criterion  is 
asymptotically  equivalent  to  AIC  (see  (4.3.3)).  ■ 

The  limitations  of  the  asymptotic  theory  for  the  order  selection  criteria  can 
be  seen  by  considering  the  criterion  obtained  by  setting  ct  equal  to  2  In  In  T 
in  (4.3.6).  This  results  in  a  criterion 

1  I  am  grateful  to  Prof.  K.  Schiirger,  Universitat  Bonn,  for  pointing  out  the  present 
improvement  of  the  corresponding  lemma  stated  in  Liitkepohl  (1991). 
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C(m)  =  In  | Su (to) |  +  2mlnlnT/T.  (4.3.15) 

Under  the  conditions  of  Proposition  4.2,  it  is  consistent.  Yet,  using  Lemma  4.1 
and  the  same  line  of  reasoning  as  in  the  proof  of  Proposition  4.3,  p(AIC)  < 
p(C )  if  2  In  In  T  <  2 K2  or,  equivalently,  if  T  <  exp(exp/L2).  For  instance, 
for  a  bivariate  process  (K  =  2),  exp(exp/\2)  «  5.14  x  1023.  Consequently, 
if  T  <  5.14  x  1023,  the  consistent  criterion  (4.3.15)  chooses  an  order  greater 
than  or  equal  to  p(AIC)  which  in  turn  has  a  positive  limiting  probability 
for  exceeding  the  true  order.  This  example  shows  that  large  sample  results 
sometimes  are  good  approximations  only  if  extreme  sample  sizes  are  available. 
The  foregoing  result  was  used  by  Quinn  (1980)  as  an  argument  for  making  ct 
a  function  of  the  dimension  K  of  the  process  in  the  HQ  criterion. 

It  is  also  of  interest  to  compare  the  order  selection  criteria  to  the  sequen¬ 
tial  testing  procedure  discussed  in  the  previous  section.  We  have  mentioned 
in  Section  4.2  that  the  order  chosen  in  a  sequence  of  tests  will  depend  on  the 
significance  levels  used.  As  a  consequence,  a  testing  sequence  may  give  the 
same  order  as  a  selection  criterion  if  the  significance  levels  are  chosen  accord¬ 
ingly.  For  instance,  AIC  chooses  an  order  smaller  than  the  maximum  order 
M  if  AIC(M  —  1)  <  AIC (M)  or,  equivalently,  if 

Alb(1)  =  T(ln  | £U(M  -  1) |  —  In \SU{M)\)  <  2 MK2  -  2 (AT -  1)A'2  =  2K2 . 

For  K  =  2,  2K 2  =  8  «  A2 (4) .go-  Thus,  for  a  bivariate  process,  in  order 
to  ensure  that  AIC  chooses  an  order  less  than  M  whenever  the  LR  testing 
procedure  does,  we  may  use  approximately  a  10%  significance  level  in  the  first 
test  of  the  sequence,  provided  the  distribution  of  Al_r(1)  is  well  approximated 
by  a  x2  (^-distribution. 

The  sequential  testing  procedure  will  not  lead  to  a  consistent  order  esti¬ 
mator  if  the  sequence  of  individual  significance  levels  is  held  constant.  To  see 
this,  note  that  for  M  >  p  and  a  fixed  significance  level  7,  the  null  hypothesis 
fLu  :  Am  =  0  is  rejected  with  probability  7.  In  other  words,  in  the  testing 
scheme,  M  is  incorrectly  chosen  as  VAR  order  with  probability  7.  Thus,  there 
is  a  positive  probability  of  choosing  too  high  an  order.  This  problem  can  be 
circumvented  by  letting  the  significance  level  go  to  zero  as  T  — >  00. 


4.3.4  Some  Small  Sample  Simulation  Results 

As  mentioned  previously,  many  of  the  small  sample  properties  of  interest  in  the 
context  of  VAR  order  selection  are  difficult  to  derive  analytically.  Therefore 
we  have  performed  a  small  Monte  Carlo  experiment  to  get  some  feeling  for  the 
small  sample  behavior  of  the  estimators.  Some  results  will  now  be  reported. 

We  have  simulated  1000  realizations  of  the  VAR(2)  process  (4.2.1)  and  we 
have  recorded  the  orders  chosen  by  FPE,  AIC,  HQ,  and  SC  for  time  series 
lengths  of  T  =  30  and  100  and  a  maximum  VAR  order  of  M  =  6.  In  addition, 
we  have  determined  the  order  by  the  sequence  of  LR  tests  described  in  Section 
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4.2  using  a  significance  level  of  5%  in  each  individual  test  and  corresponding 
critical  values  from  ^-distributions.  That  is,  we  have  used  y2-  rather  than  F- 
tests.  The  frequency  distributions  obtained  with  the  five  different  procedures 
are  displayed  in  Table  4.6.  Obviously,  for  the  sample  sizes  reported,  none  of 
the  criteria  is  very  successful  in  estimating  the  order  p  =  2  correctly.  This  may 
be  due  to  the  fact  that  A2  contains  only  a  single,  small  nonzero  element.  The 
similarity  of  AIC  and  FPE  derived  in  (4.3.3)  becomes  evident  for  T  =  100.  The 
orders  chosen  by  the  LR  testing  procedures  show  that  the  actual  significance 
levels  are  quite  different  from  their  asymptotic  approximations,  especially  for 
sample  size  T  =  30.  If  Xlr  really  had  a  y2(4)-distribution  the  order  p=M  =  6 
should  be  chosen  in  about  5%  of  the  cases  while  in  the  simulation  experiment 
p  =  6  is  chosen  for  25.4%  of  the  realizations.  Hence,  the  y2 (4)-distribution  is 
hardly  a  good  small  sample  approximation  to  the  actual  distribution  of  Xlr- 

In  Table  4.6,  we  also  present  the  sum  of  normalized  mean  squared  forecast 
errors  of  y±  and  y2  obtained  from  post-sample  forecasts  with  the  estimated 
processes.  The  quantities  shown  in  the  table  are 

1  N 

J2(yT+h(i)  -  VT{h){i))' Zy{h)  1{yT+h(i)  ~  £r(%)),  h  =  1,2,3, 

where  N  is  the  number  of  replications,  that  is,  in  this  case  N  =  1000,  yr+hii ) 
is  the  realization  in  the  i-tli  repetition  and  yr(^)(j)  is  the  corresponding  fore¬ 
cast.  Normalizing  with  the  inverse  of  the  h-step  forecast  error  variance  XJ y{h) 
is  useful  to  standardize  the  forecast  errors  in  such  a  way  so  as  to  have  roughly 
the  same  variability  and,  thus,  comparable  quantities  are  averaged.  For  large 
sample  size  T  and  a  large  number  of  replications  N,  the  average  normalized 
squared  forecast  errors  should  be  roughly  equal  to  the  dimension  of  the  pro¬ 
cess,  that  is,  for  the  present  bivariate  process  they  should  be  close  to  2. 

Although  in  Table  4.6  SC  often  underestimates  the  true  VAR  order  p  =  2, 
the  forecasts  obtained  with  the  SC  models  are  generally  the  best  for  T  =  30. 
The  reason  is  that  not  restricting  the  single  nonzero  coefficient  in  A 2  to  zero 
does  not  sufficiently  improve  the  forecasts  to  offset  the  additional  sampling 
variability  introduced  by  estimating  all  four  elements  of  the  A 2  coefficient 
matrix.  For  T  =  100,  corresponding  forecast  MSEs  obtained  with  the  different 
criteria  and  procedures  are  very  similar,  although  SC  chooses  the  correct  order 
much  less  often  than  the  other  criteria.  This  result  indicates  that  choosing  the 
correct  VAR  order  and  selecting  a  good  forecasting  model  are  objectives  that 
may  be  reached  by  different  VAR  order  selection  procedures.  Specifically,  in 
this  example,  slight  underestimation  of  the  VAR  order  is  not  harmful  to  the 
forecast  precision.  In  fact,  for  T  =  30,  the  most  parsimonious  criterion  which 
underestimates  the  true  VAR  order  in  more  than  80%  of  the  realizations  of 
our  VAR(2)  process  provides  forecasts  with  the  smallest  normalized  average 
squared  forecast  errors.  In  fact,  the  LR  tests  which  choose  larger  orders  quite 
frequently,  produce  clearly  the  worst  forecasts. 
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Table  4.6.  Simulation  results  based  on  1000  realizations  of  the  bivariate  VAR(2) 
process  (4.2.1) 


VAR 

order 

FPE 

AIC 

HQ 

SC 

LR 

T  =  30 

frequency  distributions  of  estimated  VAR  orders  in  % 

0 

0.1 

0.1 

0.6 

2.6 

0.1 

1 

46.1 

42.0 

60.4 

81.2 

29.8 

2 

33.3 

32.2 

28.5 

14.4 

16.5 

3 

8.3 

9.0 

5.0 

1.1 

6.5 

4 

3.8 

4.1 

2.2 

0.5 

8.1 

5 

3.9 

5.0 

1.5 

0.1 

13.6 

6 

4.5 

7.6 

1.8 

0.1 

25.4 

forecast 

horizon 

normalized  average  squared  forecast  errors 

1 

2.63 

2.68 

2.52 

2.37 

3.09 

2 

2.66 

2.72 

2.51 

2.41 

3.04 

3 

2.58 

2.67 

2.45 

2.35 

3.05 

VAR 

T  =  100 

order 

frequency  distributions  of  estimated  VAR  orders  in  % 

0 

0.0 

0.0 

0.0 

0.0 

0.0 

1 

17.6 

17.4 

42.7 

73.1 

20.8 

2 

69.5 

69.5 

55.5 

26.7 

53.6 

3 

8.4 

8.4 

1.7 

0.2 

5.3 

4 

2.8 

2.8 

0.1 

0.0 

6.2 

5 

1.0 

1.0 

0.0 

0.0 

5.4 

6 

0.7 

0.9 

0.0 

0.0 

8.7 

forecast 

horizon 

normalized  average  squared  forecast  errors 

1 

2.15 

2.15 

2.15 

2.17 

2.22 

2 

2.20 

2.20 

2.20 

2.22 

2.25 

3 

2.12 

2.12 

2.13 

2.12 

2.17 

It  must  be  emphasized,  however,  that  these  results  are  very  special  and 
hold  for  the  single  bivariate  VAR(2)  process  used  in  the  simulations.  Different 
results  may  be  obtained  for  other  processes.  To  substantiate  this  statement, 
we  have  also  simulated  1000  time  series  based  on  the  VAR(l)  process  (4.2.2). 
Some  results  are  given  in  Table  4.7.  While  for  sample  size  T  =  30  again  none 
of  the  criteria  and  procedures  is  very  successful  in  detecting  the  correct  VAR 
order  p  =  1,  all  four  criteria  FPE,  AIC,  HQ,  and  SC  select  the  correct  order  in 
more  than  90%  of  the  replications  for  T  =  100.  The  poor  approximation  of  the 
small  sample  distribution  of  the  LR  statistic  by  a  X2(9)-distribution  is  evident. 
Note  that  we  have  used  the  critical  values  for  5%  level  individual  tests  from 
the  x2-distribution.  As  in  the  VAR(2)  example,  the  prediction  performance 
of  the  SC  models  is  best  for  T  —  30,  although  the  criterion  underestimates 
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the  true  order  in  more  than  80%  of  the  replications.  For  both  sample  sizes, 
the  worst  forecasts  are  obtained  with  the  sequential  testing  procedure  which 
overestimates  the  true  order  quite  often. 


Table  4.7.  Simulation  results  based  on  1000  realizations  of  the  three-dimensional 
VAR(l)  process  (4.2.2) 


VAR 

order 

FPE 

AIC 

HQ 

SC 

LR 

T  =  30 

frequency  distributions  of  estimated  VAR  orders  in  % 

0 

24.3 

17.5 

44.5 

81.3 

0.7 

1 

50.7 

35.3 

39.4 

18.0 

2.2 

2 

7.5 

4.7 

3.0 

0.3 

1.3 

3 

3.0 

2.2 

0.9 

0.2 

1.5 

4 

1.8 

1.7 

0.4 

0.0 

3.7 

5 

2.9 

4.2 

1.5 

0.0 

14.9 

6 

9.8 

34.4 

10.3 

0.2 

75.7 

forecast 

horizon 

normalized  average  squared  forecast  errors 

1 

4.60 

6.06 

4.43 

3.94 

8.35 

2 

4.12 

5.42 

3.98 

3.33 

7.87 

3 

3.87 

5.11 

3.75 

3.19 

7.49 

VAR 

T  =  100 

order 

frequency  distributions  of  estimated  VAR  orders  in  % 

0 

0.0 

0.0 

0.3 

8.1 

0.0 

1 

94.1 

93.8 

99.6 

91.9 

61.2 

2 

5.0 

5.1 

0.1 

0.0 

5.4 

3 

0.7 

0.7 

0.0 

0.0 

4.3 

4 

0.2 

0.3 

0.0 

0.0 

7.3 

5 

0.0 

0.0 

0.0 

0.0 

9.3 

6 

0.0 

0.1 

0.0 

0.0 

12.5 

forecast 

horizon 

normalized  average  squared  forecast  errors 

1 

3.08 

3.08 

3.06 

3.12 

3.24 

2 

3.12 

3.12 

3.11 

3.12 

3.24 

3 

3.11 

3.11 

3.10 

3.10 

3.20 

After  these  two  simulation  experiments,  we  still  do  not  have  a  clear  an¬ 
swer  to  the  question  which  criterion  to  use  in  small  sample  situations.  One 
conclusion  that  emerges  from  the  two  examples  is  that,  in  very  small  sam¬ 
ples,  slight  underestimation  of  the  true  order  is  not  necessarily  harmful  to 
the  forecast  precision.  Moreover,  both  examples  clearly  demonstrate  that  the 
%2-approximation  to  the  small  sample  distribution  of  the  LR  statistics  is  a 
poor  one.  In  a  simulation  study  based  on  many  other  processes,  Liitkepohl 
(1985)  obtained  similar  results.  In  that  study,  for  low  order  VAR  processes, 
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the  most  parsimonious  SC  criterion  was  found  to  do  quite  well  in  terms  of 
choosing  the  correct  VAR  order  and  providing  good  forecasting  models.  Un¬ 
fortunately,  in  practice  we  often  don’t  even  know  whether  the  underlying  data 
generation  law  is  of  finite  order  VAR  type.  Sometimes  we  may  just  approxi¬ 
mate  an  infinite  order  VAR  process  by  a  finite  order  model.  In  that  case,  for 
moderate  sample  sizes,  some  less  parsimonious  criterion  like  AIC  may  give 
superior  results  in  terms  of  forecast  precision.  Therefore,  it  may  be  a  good 
strategy  to  compare  the  order  estimates  obtained  with  different  criteria  and 
possibly  perform  analyses  with  different  VAR  orders. 


4.4  Checking  the  Whiteness  of  the  Residuals 

In  the  previous  sections,  we  have  considered  procedures  for  choosing  the  or¬ 
der  of  a  VAR  model  for  the  generation  process  of  a  given  multiple  time  series. 
These  procedures  may  be  interpreted  as  methods  for  determining  a  filter  that 
transforms  the  given  data  into  a  white  noise  series.  In  this  context,  the  criteria 
for  model  choice  may  be  regarded  as  criteria  for  deciding  whether  the  resid¬ 
uals  are  close  enough  to  white  noise  to  satisfy  the  investigator.  Of  course,  if, 
for  example,  forecasting  is  the  objective,  it  may  not  be  of  prime  importance 
whether  the  residuals  are  really  white  noise  as  long  as  the  model  forecasts  well. 
There  are,  however,  situations  where  checking  the  white  noise  (whiteness)  as¬ 
sumption  for  the  residuals  of  a  particular  model  is  of  interest.  For  instance,  if 
the  model  order  is  chosen  by  nonstatistical  methods  (for  example,  on  the  basis 
of  some  economic  theory)  it  may  be  useful  to  have  statistical  tools  available 
for  investigating  the  properties  of  the  residuals.  Moreover,  because  different 
criteria  emphasize  different  aspects  of  the  data  generation  process  and  may 
therefore  all  provide  useful  information  for  the  analyst,  it  is  common  not  to 
rely  on  just  one  procedure  or  criterion  for  model  choice  but  use  a  number  of 
different  statistical  tools.  Therefore,  in  this  section,  we  shall  discuss  statistical 
tools  for  checking  the  autocorrelation  properties  of  the  residuals  of  a  given 
VAR  model. 

In  Sections  4.4.1  and  4.4.2,  the  asymptotic  distributions  of  the  residual 
autocovariances  and  autocorrelations  are  given  under  the  assumption  that  the 
model  residuals  are  indeed  white  noise.  In  Sections  4.4.3  and  4.4.4,  two  popular 
statistics  for  checking  the  overall  significance  of  the  residual  autocorrelations 
are  discussed.  The  results  of  this  section  are  adapted  from  Chitturi  (1974), 
Hosking  (1980,  1981a),  Li  &  McLeod  (1981),  and  Ahn  (1988). 


4.4.1  The  Asymptotic  Distributions  of  the  Autocovariances  and 
Autocorrelations  of  a  White  Noise  Process 

It  is  assumed  that  ut  is  a  JL-dimensional  white  noise  process  with  nonsingu¬ 
lar  covariance  matrix  Su.  For  instance,  Ut  may  represent  the  residuals  of  a 
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VAR(p)  process.  Let  U  :=  (iq, . . . ,  ut).  The  autocovariance  matrices  of  ut  are 
estimated  as 


Ci  ■■=  ru(i)  :=  i  utu't-i  =  \UF*U'i  i  =  0, 1,  •  •  • ,  h  <  T.  (4.4.1) 
£=2+1 

The  ( T  x  T)  matrix  t\  is  defined  in  the  obvious  way.  For  instance,  for  i  =  2, 
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Of  course,  for  i  =  0,Fq  =  It-  In  the  following,  the  precise  form  of  I\  is  not 
important.  It  is  useful,  though,  to  remember  that  Ft  is  defined  such  that 

T 

mw  =  ^2  utut-i- 

£=2+1 


Let 


Ch  :=  (Cu  •  •  ■ ,  Ch)  =  UF(Ih  ®  U%  (4.4.2) 

where  F  :=  (i7!, . . .  ,Fh)  is  a  (Tx  hT)  matrix  that  is  understood  to  depend 
on  h  and  T  without  this  being  indicated  explicitly.  Furthermore,  let 


ch  :=vec(Ch).  (4.4.3) 

The  estimated  autocorrelation  matrices  of  the  ut  are  denoted  by  f?,,  that  is, 

Ri  :=  D~1CiD~1 ,  i  =  0,l,...,h,  (4.4.4) 

where  D  is  a  ( K  x  K)  diagonal  matrix,  the  diagonal  elements  being  the  square 
roots  of  the  diagonal  elements  of  Co-  In  other  words,  a  typical  element  of  Ri 
is 


mm,0^nn,0 
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where  cmny  is  the  mn-ih  element  of  C(.  The  matrix  Rj  in  (4.4.4)  is  an  es¬ 
timator  of  the  true  autocorrelation  matrix  Ru(i)  =  0  for  i  ^  0.  We  use  the 
notation 

Rh  :=  (Ri, . . . ,  Rh)  and  rh  :=  vec(R;,)  (4.4.5) 

and  we  denote  by  Ru  the  true  correlation  matrix  corresponding  to  Su.  Now 
we  can  give  the  asymptotic  distributions  of  and  c^. 

Proposition  4.4  ( Asymptotic  Distributions  of  White  Noise  Autocovariances 
and  Autocorrelations ) 

Let  ut  be  a  A'-dimensional  identically  distributed  standard  white  noise  pro¬ 
cess,  that  is,  ut  and  us  have  the  same  multivariate  distribution  with  nonsin¬ 
gular  covariance  matrix  Su  and  corresponding  correlation  matrix  Ru.  Then, 
for  h  >  1, 


Vfch  4  7V(0,  Ih®Zu®  Eu) 

1 

(4.4.6) 

L 

Vfrh  4a7(0,4  ®  Ru  $  Ru)- 

(4.4.7) 

Proof:  The  result  (4.4.6)  follows  from  an  appropriate  central  limit  theorem. 
The  i.i.cl.  assumption  for  the  Ut  implies  that 

wt  =  vec (utWt-i,  ■  •  ■  i  UfK-h) 

is  a  stationary  white  noise  process  with  covariance  matrix  E{wtw't)  =  1^  ® 
Nu  ®  Nu  so  that  the  result  (4.4.6)  may,  e.g.,  be  obtained  from  the  central 
limit  theorem  for  stationary  processes  given  in  Proposition  C.13  of  Appendix 
C.  Proofs  can  also  be  found  in  Fuller  (1976,  Chapter  6)  and  Hannan  (1970, 
Chapter  IV,  Section  4)  among  others. 

The  result  in  (4.4.7)  is  a  quite  easy  consequence  of  (4.4.6).  From  Proposi¬ 
tion  3.2,  we  know  that  Co  is  a  consistent  estimator  of  Su.  Hence, 

VT  vec  (Ri)  =  Vf^D-1  ®  D-1)  vec  (C;)  4  A/"(0,  Ru  ®  Ru) 

by  Proposition  C.15(l)  of  Appendix  C  and  (4.4.6),  because 

plim(.D_1  ®  D^1)(SU  ®  EU)(D~ 1  ®  D_1) 

=  plim(D^1A„U_1  ®  D~1EUD~1)  =  Ru  ®  Ru. 


The  result  in  (4.4.6)  means  that  yfT  vec (Cf)  has  the  same  asymptotic 
distribution  as  y/T  vec (Cj),  namely, 
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Vf  vec (Ci),Vf  vec(C'j)  4  A/"(0,  Su  ®  Su). 

Moreover,  for  i  j,  the  two  estimators  are  asymptotically  independent.  By 
(4.4.7),  the  same  holds  for  \ff  vec (Ri)  and  \/T  vec (Rj). 

In  practice,  the  Ut  and  hence  U  will  usually  be  unknown  and  the  reader  may 
wonder  about  the  relevance  of  Proposition  4.4.  The  result  is  not  only  useful 
in  proving  other  propositions  but  can  also  be  used  to  check  whether  a  given 
time  series  is  white  noise.  Before  we  explain  that  procedure,  we  mention  that 
Proposition  4.4  remains  valid  if  the  considered  white  noise  process  is  allowed 
to  have  nonzero  mean  and  the  mean  vector  is  estimated  by  the  sample  mean 
vector.  That  is,  we  consider  covariance  matrices 

l  T 

Ci  =  t  4*  -  «)(«*-.  -  «)', 

t=i+ 1 

where 


Next  we  observe  that  the  diagonal  elements  of  Ru  <g>  Ru  are  all  ones.  Conse¬ 
quently,  the  variances  of  the  asymptotic  distributions  of  the  elements  of  VT r/, 
are  all  unity.  Hence,  in  large  samples  the  VTrmn y  for  i  >  0  have  approximate 
standard  normal  distributions.  Denoting  by  pmn(i)  the  true  correlation  coeffi¬ 
cients  corresponding  to  the  rmn a  test,  with  level  approximately  5%,  of  the 
null  hypothesis 

Rq  ■  Prnn (f  )  —  0  against  Hi'-  Pmn(i)  0 

rejects  Hu  if  \VTrmn^\  >  2  or,  equivalently,  \rmtlti\  >  2/i/T. 

Now  we  have  a  test  for  checking  the  null  hypothesis  that  a  given  multiple 
time  series  is  generated  by  a  white  noise  process.  We  simply  compute  the  cor¬ 
relations  of  the  original  data  (possibly  after  some  stationarity  transformation) 
and  compare  their  absolute  values  with  2 /VT.  In  Section  4.3.2,  we  found  that 
the  SC  and  HQ  estimate  of  the  order  for  the  generation  process  of  the  invest¬ 
ment/income/consumption  example  data  is  p  =  0.  Therefore,  one  may  want 
to  check  the  white  noise  hypothesis  for  this  example.  The  first  two  correlation 
matrices  for  the  data  from  1960.4  to  1978.4  are 


'  -.197 

.103 

.128  ' 

'  -.045 

.067 

.097  ' 

Ri  = 

.190 

.020 

.228 

and  i?2  = 

.119 

.079 

.009 

-.047 

.150 

-.089 

.255 

.355 

.279 

(4.4.8) 

Comparing  these  quantities  with  2 /y/T  =  2/^/73  =  .234,  we  find  that  some 
are  significantly  different  from  zero  and,  hence,  we  reject  the  white  noise 
hypothesis  on  the  basis  of  this  test. 
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In  applied  work,  the  estimated  autocorrelations  are  sometimes  plotted  and 
±2/v/?1-bounds  around  zero  are  indicated.  The  white  noise  hypothesis  is  then 
rejected  if  any  of  the  estimated  correlation  coefficients  reach  out  of  the  area 
between  the  ±2/\/T-bounds.  In  Figure  4.1,  plots  of  some  autocorrelations  are 
provided  for  the  example  data.  Some  autocorrelations  at  lags  2,  4,  8,  11,  and 
12  are  seen  to  be  significant  under  the  aforementioned  criterion. 

There  are  several  points  that  must  be  kept  in  mind  in  such  a  procedure. 
First,  in  an  exact  5%  level  test,  on  average  the  test  will  reject  one  out  of  twenty 
times  it  is  performed  independently,  even  if  the  null  hypothesis  is  correct. 
Thus,  one  would  expect  that  one  out  of  twenty  autocorrelation  estimates 
exceeds  2 /VT  in  absolute  value  even  if  the  underlying  process  is  indeed  white 
noise.  Note,  however,  that  although  Ri  and  Rj  are  asymptotically  independent 
for  i  ^  j,  the  same  is  not  necessarily  true  for  the  elements  of  Ri.  Thus, 
considering  the  individual  correlation  coefficients  may  provide  a  misleading 
picture  of  their  significance  as  a  group.  Tests  for  overall  significance  of  groups 
of  autocorrelations  are  discussed  in  Sections  4.4.3  and  4.4.4. 

Second,  the  tests  we  have  considered  here  are  just  asymptotic  tests.  In 
other  words,  the  actual  sizes  of  the  tests  may  differ  from  their  nominal  sizes. 
In  fact,  it  has  been  shown  by  Dufour  &  Roy  (1985)  and  others  that  in  small 
samples  the  variances  of  the  correlation  coefficients  may  differ  considerably 
from  1  /T.  They  will  often  be  smaller  so  that  the  tests  are  conservative  in  that 
they  reject  the  null  hypothesis  less  often  than  is  indicated  by  the  significance 
level  chosen. 

Despite  this  criticism,  this  check  for  whiteness  of  a  time  series  enjoys  much 
popularity  as  it  is  very  easy  to  carry  out.  It  is  a  good  idea,  however,  not  to 
rely  on  this  criterion  exclusively. 


4.4.2  The  Asymptotic  Distributions  of  the  Residual 
Autocovariances  and  Autocorrelations  of  an  Estimated  VAR 
Process 

Theoretical  Results 

If  a  VAR(p)  model  has  been  fitted  to  the  data,  a  procedure  similar  to  that 
described  in  the  previous  subsection  is  often  used  to  check  the  whiteness  of 
the  residuals.  Instead  of  the  actual  Ut  s,  the  estimation  residuals  are  used, 
however.  We  will  now  consider  the  consequences  of  that  approach.  For  that 
purpose,  we  assume  that  the  model  has  been  estimated  by  LS  and,  using  the 
notation  of  Section  3.2,  the  coefficient  estimator  is  denoted  by  B  and  the 
corresponding  residuals  are  U  =  (rti, . . .  ,ut)  '■=  Y  —  BZ.  Furthermore, 

Ci  :=  ^UFiU',  *  =  0, 1, ...  ft, 

Ch:=  (Cu*..,Ch)=^UF{Ih®U'), 


(4.4.9) 
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where  I?  is  a  diagonal  matrix  with  the  square  roots  of  the  diagonal  elements 
of  G\)  on  the  main  diagonal.  We  will  consider  the  asymptotic  distribution  of 
\fTch  first.  For  that  purpose  the  following  lemma  is  helpful. 

Lemma  4.2 

Let  yt  be  a  stationary,  stable  VAR.(p)  process  as  in  (4.1.1)  with  identically 
distributed  standard  white  noise  iq  and  let  B  be  a  consistent  estimator  of 
B  =  [/a  .4 1 ....  ^  Ap]  such  that  y/T  vec (B  —  B)  has  an  asymptotic  normal  dis¬ 
tribution.  Then  \/rc/,  has  the  same  asymptotic  distribution  as 

VTch  -  VTG  vec(B  —  B),  (4.4.11) 


where  G  : 

=  6" 

®  1K  with 

0 

0 

0 

Fu 

$i£u  ••• 

^h—l^u 

G  := 

0 

Fu 

^h-2^u 

0 

0 

^h—p^u 

((Kp+  1)  x  Kh). 


(4.4.12) 


Proof:  Using  the  notation  Y  =  BZ  +  U, 

U  =  Y-BZ  =  BZ  +  U-BZ  =  U-{B-  B)Z. 


Hence, 

UF(Ih  g>  U’) 

=  UF(Ih  g>  U')  -  UF  [4  ®  Z'{B  —  B)' 

—  (B  —  B)ZF(lh  ®  U')  +  (B  —  B)ZF  I  lh  ®Z'(B-  B)'  I  . 


(4.4.13) 


Dividing  by  T  and  applying  the  vec  operator,  this  expression  becomes  c/,, .  In 
order  to  obtain  the  expression  in  (4.4.11),  we  consider  the  terms  on  the  right- 
hand  side  of  (4.4.13)  in  turn.  The  first  term  becomes  y/Tc h  upon  division  by 
yjT  and  application  of  the  vec  operator. 

Dividing  the  second  and  last  terms  by  \/T  they  can  be  shown  to  converge 
to  zero  in  probability,  that  is, 


plim VfUF  \lh  ®  Z'{B  -  B)'  1  /T  =  0 


(4.4.14) 


and 

plim VT(B  -  B)ZF  \lh  ®  Z'{B  -  B)' j  /T  =  0 


(4.4.15) 
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(see  Problem  4.3).  Thus,  it  remains  to  show  that  dividing  the  third  term  in 
(4.4.13)  by  y/T  and  applying  the  vec  operator  yields  an  expression  which  is 
asymptotically  equivalent  to  the  last  term  in  (4.4.11).  To  see  this,  consider 

ZF(Ih  ®  U')  =  (. ZF yU', . . . ,  ZFhU') 

and 

T  T 

ZFjU'  =  Y,  =  E 

t=i+l  t=i+ 1 

1 

OO 

E  fyut-i-j 

3=0 


t  I 

II  00 

E  &jut-p-j 

3=0 

where  the  are  the  coefficient  matrices  of  the  canonical  MA  representation 
of  yt  (see  (2.1.17)).  Upon  division  by  T  and  application  of  the  plim  we  get 

0 

1  ,  $i- iSu 

plim -ZFiU  = 

i—pFu  J 

where  =  0  for  j  <  0.  Hence, 

0  0  0 

Fu  *PiFu  •  •  •  Fh—iFu 

plim  ^ZF(Ih®U')=  0  (ph-'iFu  =q 

.0  0  •  •  •  $h-pSu  _ 

(( Kp+l)xKh ) 

The  lemma  follows  by  noting  that 

vec  [(H  -  B)ZF(lh  ®  [/')]  =  ([ZF(lh  <g>  U')]’  ®  IK)  vec (B  -  B). 


The  next  lemma  is  also  helpful  later. 

Lemma  4.3 

If  yt  is  a  stable  VAR(p)  process  as  in  (4.1.1)  with  identically  distributed  stan¬ 
dard  white  noise,  then 
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vec (UZ') 

VTch 


>Af  0, 


r  g 

G'  Ih0Su 


(4.4.16) 


where  F  :=  \AmiZZ' /T  and  G  is  as  defined  in  (4.4.12).  ■ 

For  the  two  terms  vec (UZ')/y/T  and  y/Tch  separately,  the  asymptotic 
distributions  are  already  known  from  Lemma  3.1  and  Proposition  4.4,  respec¬ 
tively.  So  the  joint  asymptotic  distribution  is  the  new  result  here.  The  reader 
is  referred  to  Ahn  (1988)  for  a  proof.  Now  the  asymptotic  distribution  of  the 
residual  autocovariances  is  easily  obtained. 

Proposition  4.5  ( Asymptotic  Distributions  of  Residual  Autocovariances) 
Let  yt  be  a  stationary,  stable,  A'-dimensional  VAR(p)  process  as  in  (4.1.1)  with 
identically  distributed  standard  white  noise  process  ut  and  let  the  coefficients 
be  estimated  by  multivariate  LS  or  an  asymptotically  equivalent  procedure. 
Then 

Vfch^U(0  ,Sc(h)), 
where 

Zc(h)  =  (4  0  -  GT-lG)  0  Su 

=  {Ih®Zu®Su)-G[rY{Q)-1®Su]G'.  (4.4.17) 

Here  G  and  r  are  the  same  matrices  as  in  Lemma  4.3,  1  y(0)  is  the  covariance 
matrix  of  Yt  =  (y't, . . . ,  y't_p+1)'  and  G  :=  &  Z)  Ik,  where  G  is  a  (Kp  x  Kh) 
matrix  which  has  the  same  form  as  G  except  that  the  first  row  of  zeros  is 
eliminated.  ■ 

Proof:  Using  Lemma  4.2,  /Tcj,  is  known  to  have  the  same  asymptotic  dis¬ 
tribution  as 


VTch  —  VTG  vec  (B  —  B) 


G'  ®  IK  :  / 

Vt  vec {B  —  B) 

\/Tch 

"  (ZZ'\~ 1 

r 

G'  ®  IK  :  I 

®Ik  0 

0  I 

_ 

Jr  vec (UZ') 
VTch 


Noting  that  pXmaiZZ' fT)  1  =  F  1,  the  desired  result  follows  from  Lemma 
4.3  and  Proposition  C.15(l)  of  Appendix  C  because 


-G'T-1  ®  1K  :  I 

( 

F 

G 

.0  A'uJ 

'  -r~1G0iK  ' 

L 

V 

G' 

lh.  ®  Fu 

I 

=  (4  ®  Fu  -  GT-'G)  ®  Fu 
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=  lh  (g  Eu  ®EU-  (G'  ®  Vk)(jT”1  <g  EU)(G  ®  IK) 
=  Ih  ®EU®EU-  G[/Y( or1  ®  I4]G'. 


The  form  (4.4.17)  shows  that  the  variances  are  smaller  than  (not  greater 
than)  the  diagonal  elements  of  I ^  ®  Eu  <g)  ifu.  In  other  words,  the  variances 
of  the  asymptotic  distribution  of  the  white  noise  autocovariances  are  greater 
than  or  equal  to  the  corresponding  quantities  of  the  estimated  residuals.  A 
similar  result  can  also  be  shown  for  the  autocorrelations  of  the  estimated 
residuals. 

Proposition  4.6  ( Asymptotic  Distributions  of  Residual  Autocorrelations) 
Let  D  be  the  ( K  x  K)  diagonal  matrix  with  the  square  roots  of  Eu  on  the  diag¬ 
onal  and  define  Go  :=  G(4®-D-1)-  Then,  under  the  conditions  of  Proposition 
4.5, 

Vfvh^AT(0,Er(h)), 

where 

2, (A)  =  [(4  ®  Ru)  -  Go-r_1Go]  ®  Ru.  (4.4.18) 

Specifically, 

VT  vec(Rj)  4  Af(0,  Er(j)),  j  =  1,2,..., 
where 


/ 

Ru  ~  D~XZU  [0  :  <?'_!  :  • 

3-P  J 

r  ° 

4-i 

\ 

EuD-1 

.  4-p  J 

) 

(4.4 

with  d>i  =  0  for  i  <  0. 

Proof:  Noting  that 

rh  =  vec(R/i)=vec  D~1Ch(Ih  <g  D^1) 

=  (Ih  ®  D~l  <g  5_1)ch 

and  D~x  is  a  consistent  estimator  of  D~ 1,  we  get  from  Proposition  4.5  that 
%/T 4  has  an  asymptotic  normal  distribution  with  mean  zero  and  covariance 
matrix 

(4  0  D -1  ®  i?"1){(4  ®  i4  -  G' T~lG)  ®  27„}(4  ®  ®  £_1) 

=  [(4®i4)-G{,r-1G0]®i4, 

where  H_1  EUD~1  =  Ru  has  been  used.  ■ 
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From  (4.4.19),  it  is  obvious  that  the  diagonal  elements  of  the  asymptotic 
covariance  matrix  are  not  greater  that  1  because  a  positive  semidefinite  ma¬ 
trix  is  subtracted  from  Ru.  Hence,  if  estimated  residual  autocorrelations  are 
used  in  a  white  noise  test  in  a  similar  fashion  as  the  autocorrelations  of  the 
original  data,  we  will  get  a  conservative  test  that  rejects  the  null  hypothesis 
less  often  than  is  indicated  by  the  significance  level,  provided  the  asymptotic 
distribution  is  a  correct  indicator  of  the  small  sample  behavior  of  the  test.  In 
particular,  for  autocorrelations  at  small  lags  the  variances  will  be  less  than  1, 
while  the  asymptotic  variances  approach  one  for  elements  of  \fTRj  with  large 
j.  This  conclusion  follows  because  $j-i,  ■  ■  ■  ,&j-P  approach  zero  as  j  — >  oo. 
As  a  consequence,  the  matrix  subtracted  from  Ru  goes  to  zero  as  j  — >  oo. 

In  practice,  all  unknown  quantities  are  replaced  by  estimates  in  order  to 
obtain  standard  errors  of  the  residual  autocorrelations  and  tests  of  specific 
hypotheses  regarding  the  autocorrelations.  It  is  perhaps  worth  noting,  though, 
that  if  r  is  estimated  by  ZZ' /T,  we  have  to  use  the  ML  estimator  Eu  for  ZJU 
to  ensure  positive  variances. 


An  Illustrative  Example 

As  an  example,  we  consider  the  VAR(2)  model  for  the  investment /income/con¬ 
sumption  system  estimated  in  Section  3.2.3.  For  j  =  1,  we  get 


.015 

-.011 

-.010  " 

(.026) 

(.033) 

(.049) 

-.007 

-.002 

-.068 

(.026) 

(.033) 

(.049) 

-.024 

-.045 

-.096 

(.026) 

(.033) 

(.049). 

where  the  estimated  standard  errors  are  given  in  parentheses.  Obviously,  the 
standard  errors  of  the  elements  of  I?i  are  much  smaller  than  1  /\JT  =  .117 
which  would  be  obtained  if  the  variances  of  the  elements  of  \/TR\  were  1.  In 


f?6  — 


for  j  =  6 

,  we  get 

'  .053 

-.008 

-.062  ' 

(.117) 

(.116) 

(.117) 

.165 

.030 

-.051 

(.117) 

(.116) 

(.117) 

.068 

.026 

.020 

.  (-H7) 

(.116) 

(•117). 

where  the  standard  errors  are  very  close  to  .117. 

In  Figure  4.2,  we  have  plotted  the  residual  autocorrelations  and  twice  their 
asymptotic  standard  errors  (approximate  95%  confidence  bounds)  around 
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zero.  It  is  apparent  that  the  confidence  bounds  grow  with  increasing  lag  length. 
For  a  rough  check  of  5%  level  significance  of  autocorrelations  at  higher  lags, 
we  may  use  the  ±2/-\/T-bounds  in  practice,  which  is  convenient  from  a  com¬ 
putational  viewpoint. 


Fig.  4.2.  Estimated  residual  autocorrelations  with  two-standard  error  bounds  for 
the  investment /income/consumption  VAR(2)  model. 


There  are  significant  residual  autocorrelations  at  lags  3,  4,  8,  and  11.  While 
the  significant  values  at  lags  3  and  4  may  be  a  reason  for  concern,  one  may  not 
worry  too  much  about  the  higher  order  lags  because  one  may  not  be  willing 
to  fit  a  high  order  model  if  forecasting  is  the  objective.  As  we  have  seen  in 
Section  4.3.4,  slight  underfitting  may  even  improve  the  forecast  performance. 
In  order  to  remove  the  significant  residual  autocorrelations  at  low  lags,  it  may 
help  to  fit  a  VAR(3)  or  VAR(4)  model.  Of  course,  this  conflicts  with  choosing 
the  model  order  on  the  basis  of  the  model  selection  criteria.  Thus,  it  has  to 
be  decided  which  criterion  is  given  priority. 

It  may  be  worth  noting  that  a  plot  like  that  in  Figure  4.2  may  give  a 
misleading  picture  of  the  overall  significance  of  the  residual  autocorrelations 
because  they  are  not  asymptotically  independent.  In  particular,  at  low  lags 
there  will  not  only  be  nonzero  correlation  between  the  elements  of  a  specific  Rj 
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but  also  between  Rj  and  Ri  for  i  ^  j.  Therefore,  it  is  desirable  to  have  tests 
for  overall  significance  of  the  residual  autocorrelations  of  a  VAR(p)  model. 
Such  tests  are  discussed  in  the  next  subsections. 

4.4.3  Portmanteau  Tests 

The  foregoing  results  may  also  be  used  to  construct  a  popular  test  for  the 
overall  significance  of  the  residual  autocorrelations  up  to  lag  h.  This  test  is 
commonly  called  portmanteau  test.  It  is  designed  for  testing 

H0  :  Rh  =  (Ri, . . . ,  Rh)  =  0  against  Hi  :  Rh  ^  0.  (4.4.20) 

The  test  statistic  is 
h 

Qh  :=  TJ2  tr (RrR-'RtR-1) 

i= 1 
h 

=  TJ2  tr(R'iR-lRlR-1D-1D) 

i—1 

h 

=  tr(DR'iDD~1R~1D~1DRiDD~1R~1D~1) 

i= 1 
h 

=  TJ2  tr (C'd-^C-1).  (4.4.21) 

i—l 

Obviously,  this  statistic  is  very  easy  to  compute  from  the  estimated  residuals. 
By  Proposition  4.5,  it  has  an  approximate  asymptotic  ^-distribution. 

Proposition  4.7  ( Approximate  Distribution  of  the  Portmanteau  Statistic) 
Under  the  conditions  of  Proposition  4.5,  we  have,  approximately,  for  large  T 
and  h, 

h 

Qh  =  tJ2  tr(c'a0-iaia0-1) 

i=l 

=  T  vec(C h)\lh  ®  a-1  ®  a-1)  vec(C/j)  w  X2{K2{h  -  p)).  (4.4.22) 

■ 

Sketch  of  the  proof:  By  Proposition  C.15(5)  of  Appendix  C,  Qh  has  the  same 
asymptotic  distribution  as 

Tc'h(Ih  ®  A”1  ®  H~1)ch. 

Defining  the  ( K  x  K)  matrix  P  such  that  PP'  =  Hu  and 

c h  ■-  (4  ®  p®  pylch, 
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it  is  easily  seen  that  Qh  has  the  same  asymptotic  distribution  as  T  c'hCh- 

Hence,  by  Proposition  C.15(6),  it  suffices  to  show  that  \/Tch  -i 7V(0, 17), 
where  17  is  an  idempotent  matrix  of  rank  K2h  —  K2p.  Because  an  approxi¬ 
mate  limiting  ^-distribution  of  Qh  is  claimed  only,  we  just  show  that  17  is 
approximately  equal  to  an  idempotent  matrix  with  rank  K2(h  —  p). 

Using  Proposition  4.5,  we  get 

n  =  Qh®p~x  ®p-1)xc{h)Qh®p’-1  ®p’~x) 

=  lhK2-PG[PY(0)-1®Su]G,P', 

where  P  =  Ih®  P”1  ®  P_1  and  G  is  defined  in  Proposition  4.5.  Noting  that 
the  ij- th  block  of  Jy  (0)  is 


Cov(yt~i,yt-j)  1  y (j  i)  ^ 


n—0 


with  =  0  for  k  <  0,  we  get  approximately, 
h 


Py(0)®£- 


'1=1 

h 


u: 


E0  P  P’~1P~1P  <f>’ 


Ln—1 

=  G"P'PG. 


-l 


Hence,  if  h  is  such  that  <Pi  w  0  for  i  >  h  —  p, 

f2  w  7^2  -  pG(G'p'pG)-1G,p'. 


Thus,  17  is  approximately  equal  to  an  idempotent  matrix  with  rank 

tr(4X2  -  PG(G'P'PG)"1G"P')  =  hK2  —  pK2, 

as  was  to  be  shown.  ■ 


Of  course,  these  arguments  do  not  fully  prove  Proposition  4.7  because 
we  have  not  shown  that  an  approximately  idempotent  matrix  17  leads  to  an 
approximate  y2-distribution.  To  actually  obtain  the  limiting  ^-distribution, 
we  have  to  assume  that  h  goes  to  infinity  with  the  sample  size.  Because  the 
sketch  of  the  proof  should  suffice  to  show  in  what  sense  the  result  is  approxi¬ 
mate,  we  do  not  pursue  this  issue  further  and  refer  the  reader  to  Ahn  (1988) 
for  details.  For  practical  purposes,  it  is  important  to  remember  that  the  y2- 
approximation  to  the  distribution  of  the  test  statistic  may  be  misleading  for 
small  values  of  h. 

Like  in  previous  sections  we  have  discussed  asymptotic  distributions  in  this 
section.  Not  knowing  the  small  sample  distribution  is  clearly  a  shortcoming 
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because,  in  practice,  infinite  samples  are  not  available.  Using  Monte  Carlo 
techniques,  it  was  found  by  some  researchers  that  in  small  samples  the  nominal 
size  of  the  portmanteau  test  tends  to  be  lower  than  the  significance  level  chosen 
(Davies,  Triggs  &  Newbold  (1977),  Ljung  &  Box  (1978),  Hosking  (1980)).  As 
a  consequence  the  test  has  low  power  against  many  alternatives.  Therefore  it 
has  been  suggested  to  use  the  modified  test  statistic 

h 

Qh  ■=  T 2  YST  ~  *)_1  MC'C-1^-1).  (4.4.23) 

The  modification  may  be  regarded  as  an  adjustment  for  the  number  of  terms 
in  the  sum  in 

1  T 

c'i  =  T  t-i ■ 

t=i+ 1 

For  T  — >  oo,  T/[T2(T  —  t)_1]  — »  1  and,  thus,  Qh  has  the  same  asymptotic 
distribution  as  Qh,  that  is,  approximately  in  large  samples  and  for  large  h, 

Qh  ~  X2(K2(h  —  p)).  (4.4.24) 

For  our  example  model,  we  obtained  Q 12  =  81.9.  Comparing  this  value 
with  x2(K2(h  —  p)). 95  =  x2(90).95  «  113  shows  that  we  cannot  reject  the 
white  noise  hypothesis  for  the  residuals  at  a  5%  level. 

As  mentioned  in  the  introduction  to  this  section,  these  tests  can  also  be 
used  in  a  model  selection/order  estimation  procedure.  A  sequence  of  hypothe¬ 
ses  as  in  (4.2.15)  is  tested  in  such  a  procedure  by  checking  whether  the  resid¬ 
uals  are  white  noise.  In  the  following,  Lagrange  multiplier  tests  for  residual 
autocorrelation  will  be  presented. 

4.4.4  Lagrange  Multiplier  Tests 

Another  way  of  testing  a  VAR  model  for  residual  autocorrelation  is  to  assume 
a  VAR  model  for  the  error  vector,  ut  =  DiUt-i  +  -  ■  ■  +  DhUt~h  +  Vt ,  where  vt  is 
white  noise.  It  is  equal  to  Ut  if  there  is  no  residual  autocorrelation.  Therefore, 
we  wish  to  test  the  pair  of  hypotheses 

H0  :  Di  =  ■  •  •  =  Dh  =  0  against 

Hi  :  Dj  0  for  at  least  one  j  €  {1, . . . ,  h}.  (4.4.25) 

In  this  case,  it  is  convenient  to  use  the  LM  principle  for  constructing  a  test 
because  we  then  only  need  to  estimate  the  restricted  model  where  Ut  =  Vt- 
We  determine  the  test  statistic  with  the  help  of  the  auxiliary  regression  model 
(see  also  Appendix  C.7) 


Ut  —  v  +  Aiyt-i  +  •  •  •  +  Apyt-p  +  D\Ut-\  +  •  •  •  +  DhUt-h  +  £t 
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or,  for  t  =  1 , . . .  ,T, 


U  =  BZ  +  DU  +  £, 

where  D  :=  [D\  •.■■■■.  Dh]  is  ( K  x  Kh ),  U  :=  (4  ®  17)4  with  F  as  in  (4.4.2), 
£  :=  [ci, . . .  is  a  (/V  x  T)  matrix  and  the  other  symbols  are  defined  as 
before.  In  particular,  the  Ut  are  the  residuals  from  LS  estimation  of  the  original 
VAR(p)  model  and  Ut  =  0  for  t  <  0.  The  LS  estimator  of  [B  :  D]  from  the 
auxiliary  model  is 


[B  :D\  =  U[Z'  :  U'] 


=  [ UZ '  :  UU'} 


[Z'  :  Ur 


ZZ'  ZU' 
UZ'  UU' 


=  [0  :  UU'] 


ZZ'  ZU' 
UZ'  UU' 


where  U Z'  =  0  from  the  first  order  conditions  for  computing  the  LS  estimator 
has  been  used.  Thus,  applying  the  rules  for  the  partitioned  inverse  (Appendix 
A. 10,  Rule  (2))  gives 

D  =  UU'[UU'  -  UZ\ZZ')~XZU'YX.  (4.4.26) 

The  standard  y2-statistic  for  testing  D  =  0  then  becomes 

\LM(h)  =  vec (D)'  {[UU'  -  UZ\ZZ'YXZU'\  ®  A"1)  vec (D) 

=  vec  {UU')'  {[UU'  -  UZ'  {ZZ'Y1  ZU'Y1  ®  vec  {UU'), 

where 

vec  (D)  =  {[UU'  -  UZ'{ZZ'Y1ZU']  1  ®  JK)  vec  {UU') 

has  been  used.  Noting  that  UU'  =  t/F(J;,®t/,)  shows  that  T-1  vec  {UU')  =  c  h- 
Moreover,  from  results  in  Section  4.4.2  we  get 

plim^WZ?  =  plim^(4  ®  U)F'F(Ih  0  U')  =  Ih  ®  Eu 

and 

plim  ^UZ'iZZ'Y'ZU'  =  GF_1G 
(see  the  proof  of  Lemma  4.2).  Hence, 

Uc(h)  =  ^[UU'  -  UZ'{ZZ'Y1ZU']  0  Eu 
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is  a  consistent  estimator  of  Fc (h)  and,  because  the  foregoing  results  imply 
that 

A  LM(h)  =  Tc'bZ^h)-1^, 

the  asymptotic  y2-distribution  of  this  statistic  follows  from  Propositions  4.5 
and  C.15(5). 

Proposition  4.8  ( Asymptotic  Distribution  of  the  LM  Statistic  for  Residual 
Autocorrelation) 

Under  the  conditions  of  Proposition  4.5, 

A LM{h)  4  X\hK2). 


The  LM  test  for  residual  autocorrelation  is  sometimes  called  Breusch- 
Godfrey  test  because  it  was  proposed  by  Breusch  (1978)  and  Godfrey  (1978) 
(see  also  Godfrey  (1988)).  Unfortunately,  the  %2-distribution  was  found  to 
be  a  poor  approximation  of  the  actual  null  distribution  of  Alm(^)  in  many 
situations  (Edgerton  &  Shukur  (1999)  and  Doornik  (1996)).  Even  a  standard 
E-approximation  is  unsatisfactory.  However,  Doornik  (1996)  finds  that  the 
following  statistic  derived  from  considerations  in  Rao  (1973,  §8c.5)  provides 
satisfactory  results  in  small  samples,  if  it  is  used  with  critical  values  from  an 
F(hK 2,  Ns  —  \K2h  +  l)-distribution: 


Fnao{h) 


( det(Eu)\1/S  1  Ns  -  \K2h  +  1 

(det(U£)J  K2h 


Here 


/  I<4h2  -  4  A  1/2 

\I<2  +  K2h 2  -  5  )  ’ 


N  =  T  —  Kp  —  1  -  Kh  —  -(K  -  Kh  +  1), 


and  Fe  is  the  residual  covariance  estimator  from  an  unrestricted  LS  estimation 
of  the  auxiliary  model  U  =  BZ  +  DU  +  £. 

We  have  also  applied  these  tests  to  our  example  data  and  give  some  results 
in  Table  4.4.4.  It  turns  out  that  neither  of  the  tests  finds  strong  evidence  for 
remaining  residual  autocorrelation.  All  p- values  exceed  10%.  Recall  that  a 
p-value  represents  the  probability  of  getting  a  test  value  greater  than  the 
observed  one,  if  the  null  hypothesis  is  true.  Therefore,  even  at  a  significance 
level  of  10%,  the  null  hypothesis  of  no  residual  autocorrelation  cannot  be 
rejected. 

In  contrast  to  the  portmanteau  tests  which  should  be  used  for  reasonably 
large  h  only,  the  LM  tests  are  more  suitable  for  small  values  of  h.  For  large 
h,  the  degrees  of  freedom  in  the  auxiliary  regression  model  will  be  exhausted 
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Table  4.8.  Autocorrelation  tests  for  investment/income/consumption  example 
VAR(2)  model,  estimation  period  1960.4-1978.4 


test 

h 

test  value 

approximate 

distribution 

p- value 

\lm{E) 

i 

6.37 

X2(9) 

0.70 

2 

15.52 

X2(18) 

0.62 

3 

32.81 

X2(27) 

0.20 

4 

46.60 

X2  (36) 

0.11 

Kao  (^) 

1 

0.62 

F(  9, 148) 

0.78 

2 

0.76 

F(18, 164) 

0.75 

3 

1.14 

F(27, 161) 

0.30 

4 

1.26 

F(36, 154) 

0.17 

and  the  statistic  cannot  be  computed  in  the  way  described  in  the  foregoing. 
An  LM  test  for  higher  order  residual  autocorrelation  may  be  based  on  the 
auxiliary  model 


ik  —  v  +  Aiyt-i  +  ■  •  •  +  Apyt-p  +  DhUt-h  +  £* 


and  on  a  test 


fL0  :  Dh  =  0  versus  Hi  :  Dh  7^  0. 

The  relevant  LM  statistic  can  be  shown  to  have  an  asymptotic  y2(A'2)- 
distribution  under  H^. 


4.5  Testing  for  Nonnormality 

Normality  of  the  underlying  data  generating  process  is  needed,  for  instance, 
in  setting  up  forecast  intervals.  Nonnormal  residuals  can  also  indicate  more 
generally  that  the  model  is  not  a  good  representation  of  the  data  generation 
process  (see  Chapter  16  for  models  for  nonnormal  data).  Therefore,  testing 
this  distributional  assumption  is  desirable.  We  will  present  tests  for  multi¬ 
variate  normality  of  a  white  noise  process  first.  In  Subsection  4.5.2,  it  is  then 
demonstrated  that  the  tests  remain  valid  if  the  true  residuals  are  replaced  by 
the  residuals  of  an  estimated  VAR(p)  process. 

4.5.1  Tests  for  Nonnormality  of  a  Vector  White  Noise  Process 

The  tests  developed  in  the  following  are  based  on  the  third  and  fourth  central 
moments  (skewness  and  kurtosis)  of  the  normal  distribution.  If  £  is  a  univari¬ 
ate  random  variable  with  standard  normal  distribution,  i.e. ,  x  ~  A/”(0, 1),  its 
third  and  fourth  moments  are  known  to  be  E(x 3)  =  0  and  E(x4)  =  3.  Let  ut 
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be  a  K -dimensional  Gaussian  white  noise  process  with  ut  N{nulEu)  and 
let  P  be  a  matrix  satisfying  PP'  =  Eu.  For  example,  P  may  be  obtained  by 
a  Choleski  decomposition  of  Eu.  Then 

wt  =  ( wit,...,wKt )'  :=  P~1(ut  -  Mu)  ~7V(0,7x). 


In  other  words,  the  components  of  Wt  are  independent  standard  normal  ran¬ 
dom  variables.  Hence, 


\  <  ] 

"  1 

"  3  " 

E 

. 

=  0  and  E 

. 

. 

= 

.  WKt  \ 

.  WKt  \ 

3 

This  result  will  be  utilized  in  checking  the  normality  of  the  white  noise  process 
Ut-  The  idea  is  to  compare  the  third  and  fourth  moments  of  the  transformed 
process  with  the  theoretical  values  in  (4.5.1)  obtained  for  a  Gaussian  process. 
For  the  univariate  case,  the  corresponding  test  is  known  as  the  Jarque-Bera  or 
Lomnicki-Jarque-Bera  test  (see  Jarque  &  Bera  (1987)  and  Lomnicki  (1961)). 

For  constructing  the  test,  we  assume  to  have  observations  u\, . . . ,  ut  and 
define 


u 


1 

T 


T 

t= i 


Su  ■= 


1 

T-  1 


yic ut  -  u)(ut  -  u)' , 

t 


and  Ps  is  a  matrix  for  which  PSP'S  =  Su  and  such  that  plim(F’s  —  P)  =  0. 
Moreover, 

vt  ■=  (vu, . .  .,vKt)'  =  P^(ut  ~u),  t  =  l,...,T, 

bi  :=  (&n,  •  •  • ,  bKi)'  with  bkl  =  k  =  l,...,K,  (4.5.2) 

t 

and 

b2  :=  (b12,...,bK2y  with  bk2  =  k  =  l,...,K.  (4.5.3) 

t 

Thus,  bi  and  b2  are  estimators  of  the  vectors  in  (4.5.1).  In  the  next  proposition, 
the  asymptotic  distribution  of  b\  and  b2  is  given. 

Proposition  4.9  ( Asymptotic  Distribution  of  Skewness  and  Kurtosis ) 

If  Ut  is  Gaussian  white  noise  with  nonsingular  covariance  matrix  Eu  and 
expectation  / iu ,  ut  ~  7V(/Xu,  Eu),  then 


VT 


bi 


b2  —  3  k 


6 IK  0  ]\ 

0  24 IK  \  )  ■ 
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In  other  words,  &i  and  b2  are  asymptotically  independent  and  normally 
distributed.  The  proposition  implies  that 


As  :=  Tb'ibi/6  —>  x2(/t ) 
and 

Afc  :=  T{b2  -  3 K)'(b2  ~  3K)/24  4x2(/F). 
The  first  statistic  can  be  used  to  test 


r  <  1 

"  wft  " 

H0:E 

. 

■ 

=  0  against  Hi  :  E 

co^ 

$ 

_ 1 

.  WKt  . 

7^0 


(4.5.4) 

(4.5.5) 

(4.5.6) 


and  Afc  may  be  used  to  test 


"  w  ft  1 

r  4-1 
wit 

H0:E 

. 

=  3 k  against  Hi  :  E 

n,A 

.  wKt  \ 

_  WKt  . 

7^  3  k- 


(4.5.7) 


Furthermore, 

Asfc  :=  As  +  Afc  -i  x2(2K),  (4.5.8) 

which  may  be  used  for  a  joint  test  of  the  null  hypotheses  in  (4.5.6)  and  (4.5.7). 


Proof  of  Proposition  f.9 
We  state  a  helpful  lemma  first. 

Lemma  4.4 

Let  zt  =  (zit, . . . ,  Zxt)'  be  a  Gaussian  white  noise  process  with  mean  pz  and 
covariance  matrix  Ik,  he.,  Zt  ~  Ik)-  Furthermore,  let 


1 


*  =  (z!,...,zKy  ■■= 


Zt, 


t=  1 


bi^z  a  (K  x  1)  vector  with  k- th  component  bki,z  :=  ^  ^{Zkt  -  Zk)3, 


and 


b2,z  a  ( K  x  1)  vector  with  k- th  component  bk2,z  7^  J2(zkt  -  zk) 


t= 1 


Then 

Vf 


bi,z 

b2,z  ~  3k 


WO, 


6  Ik  0 

0  24 IK 


(4.5.9) 
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The  proof  of  this  lemma  is  easily  obtained,  for  instance,  from  results  of 
Gasser  (1975).  Proposition  4.9  follows  by  noting  that  P„  is  a  consistent  es¬ 
timator  of  P  (defined  such  that  PP'  =  XJU)  and  by  defining  zt  =  P~lut- 
Hence, 


VT{PS  1  ®  Ps  1  ®  Ps  :)  -  ~  u  )  8  (tit  -  u)  8  (tit  -  u) 


-Vf 

=  (pr1  < 


—  _  z)  ®  (**  - z)  ®  -  ■z) 


T 

p; 


.  Ps_1  -  P_1  <8  P-1  <g>  P-1) 


Yjut  -  u)  (8  ( ut  -  u)  ®  (tit  -  ti)  o. 


An  analogous  result  is  obtained  for  the  fourth  moments.  Consequently, 


Vf 


bi  —  bi>z 

^2  —  b2,z 


0 


and  the  proposition  follows  from  Proposition  C.2(2)  of  Appendix  C.  ■ 


Remark  1  In  Proposition  4.9,  the  white  noise  process  is  not  required  to  have 
zero  mean.  Thus,  tests  based  on  Xs,  A*,,  or  Xsk  may  be  applied  if  the  original 
observations  are  generated  by  a  VAR(O)  process.  ■ 


Remark  2  It  is  known  that  in  the  univariate  case  tests  based  on  the  skewness 
and  kurtosis  (third  and  fourth  moments)  have  small  sample  distributions  that 
differ  substantially  from  their  asymptotic  counterparts  (see,  e.g.,  White  & 
MacDonald  (1980),  Jarque  &  Bera  (1987)  and  the  references  given  there). 
Therefore,  tests  based  on  As,  Afc,  and  A sk,  in  conjunction  with  the  asymptotic 
^-distributions  in  (4.5.4),  (4.5.5),  and  (4.5.8),  must  be  interpreted  cautiously. 
They  should  be  regarded  as  rough  checks  of  normality  only.  ■ 


Remark  3  Tests  based  on  As,  A*,,  and  Xsk  cannot  be  expected  to  possess 
power  against  distributions  having  the  same  first  four  moments  as  the  nor¬ 
mal  distribution.  Thus,  if  higher  order  moment  characteristics  are  of  interest, 
these  tests  cannot  be  recommended.  Other  tests  for  multivariate  normality 
are  described  by  Mardia  (1980),  Baringhaus  &  Henze  (1988),  and  others.  ■ 


4.5.2  Tests  for  Nonnormality  of  a  VAR  Process 

A  stationary,  stable  VAR(p)  process,  say 

yt  -  M  =  -  A*)  4 - 1-  Ap{yt-p  -  At)  +  Ut,  (4.5.10) 

is  Gaussian  (normally  distributed)  if  and  only  if  the  white  noise  process  Ut  is 
Gaussian.  Therefore,  the  normality  of  the  yt  s  may  be  checked  via  the  iq’s.  In 
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practice,  the  wt’s  are  replaced  by  estimation  residuals.  In  the  following  we  will 
demonstrate  that  this  is  of  no  consequence  for  the  asymptotic  distributions  of 
the  A  statistics  considered  in  the  previous  subsection. 

The  reader  may  wonder  why  normality  tests  are  based  on  the  residuals 
rather  than  the  original  observations  yt.  The  reason  is  that  tests  based  on  the 
latter  may  be  less  powerful  than  those  based  on  the  estimation  residuals.  For 
the  univariate  case  this  point  was  demonstrated  by  Lutkepohl  &  Schneider 
(1989).  It  is  also  worth  recalling  that  the  forecast  errors  used  in  the  construc¬ 
tion  of  forecast  intervals  are  weighted  sums  of  the  uf  s.  Therefore,  checking 
the  normality  of  these  quantities  makes  sense  if  the  aim  is  to  establish  interval 
forecasts.  The  next  result  states  that  Proposition  4.9  remains  valid  if  the  true 
white  noise  innovations  Ut  are  replaced  by  estimation  residuals. 

Proposition  4.10  ( Asymptotic  Distribution  of  Residual  Skewness  and  Kur- 
tosis) 

Let  yt  be  a  A-dimensional  stationary,  stable  Gaussian  VAR.(p)  process  as  in 
(4.5.10),  where  Ut  is  zero  mean  white  noise  with  nonsingular  covariance  matrix 
Su  and  let  A\ , . . . ,  Ap  be  consistent  and  asymptotically  normally  distributed 
estimators  of  the  coefficients  based  on  a  sample  j/i , . . . ,  j/t  and  possibly  some 
presample  values.  Define 

ut  ■=  (yt  ~y)~  A^yt-i  -y) - Ap(yt-p-y ),  t  =  l,...,T, 

1  T 

Su  l=  T-Kp-  1^“A 

^  t—i 

and  let  P  be  a  matrix  satisfying  PP'  =  Su  such  that  plim(P  —  P)  =  0. 
Furthermore,  define 

wt  =  (wit,  •  •  • ,  wKt)'  :=  P^Ut, 

^  1  T 

bi  =  (bn,.  ■  ■ , bK i)'  with  bki  ■=  ^  ^ wkt,  k  =  1,  — , K, 

t= l 

and 

^  1  T 

b2  =  (bi2,...,bK2y  with  bk2  :=  k  =  l,...,K. 

t=  1 

Then 
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Although  the  proposition  is  formulated  in  terms  of  the  mean-adjusted 
form  (4.5.10)  of  the  process,  it  also  holds  if  estimation  residuals  from  the 
standard  intercept  form  are  used  instead.  The  parameter  estimators  may  be 
unconstrained  ML  or  LS  estimators.  However,  the  proposition  does  not  require 
this.  In  other  words,  the  proposition  remains  valid  if,  for  instance,  restricted 
LS  or  generalized  LS  estimators  are  used,  as  discussed  in  the  next  chapter. 
The  following  lemma  will  be  helpful  in  proving  Proposition  4.10. 

Lemma  4.5 

Under  the  conditions  of  Proposition  4.10, 


plim 


T  T 

1  1 

— —  Y,  ut  8  ut  8  ut - -j=. 

V  J- 


-t=  y ^(ut  - u)  ®  iut  -  u)  ®  (ut  -  u) 

* j  t-i 


=  0 

(4.5.11) 


and 


plim 


— —  Yut®ut®ut®ut 

VtYi 


—j=  YXUt  -u)tg)  ( Ut  -  u)  ®  (ut  -  u)  (g)  ( Ut  -  u) 

*  t=  1 


=  0  (4.5.12) 


Proof:  A  proof  for  the  special  case  of  a  VAR(l)  process  yt  is  given  and  the 
generalization  is  left  to  the  reader.  Also,  we  just  show  the  first  result.  The 
second  one  follows  with  analogous  arguments.  For  the  special  VAR(l)  case, 

Ut  =  (yt  -y)  -  -y) 

=  (ut -u)  +  (A1  -  A1)(yt_1 -y)  +  aT, 
where  ar  =  A\(yx  —  yo)/T.  Hence, 


1 

VT 


E 


[ut  0  Ut  ®  Ut] 


[(ut  -  u)  18)  ('Ut  -  u)  8)  (ut 


u)]  +  dT , 


where  cLt  is  a  sum  of  expressions  of  the  type 
1 


(Ai  -  Ai)(j/t_i  -  y)  +  aT  8  (ut  -  u)  8  (ut  -  u) 


=  Vf  (A1-A1)8l2K  j,  Y[(yt-1  -y)®  (ut  -u)8  (ut 


+VTcit  0-^  [(ut  -u)8(ut  —  u 


-u)} 

)],  (4.5.13) 
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that  is,  dr  consists  of  sums  of  Kronecker  products  involving  (A\  —  A\)(yt.  %  — 
y),  ( ut  —u),  and  ax-  Therefore,  dr  =  op(l).  For  instance,  (4.5.13)  goes  to  zero 
in  probability  because 


plim—  —  u)  ®  (ut  —  u)  exists  and  VTot  =  op(l) 


so  that  the  last  term  in  (4.5.13)  vanishes.  Moreover,  the  elements  of  \/T(Ai  — 
Ai)  converge  in  distribution  and 


plhn—  ^2(yt-i  -y)®  ( Ut  -u)®(ut-u)=  0 


(4.5.14) 


(see  Problem  4.4).  Hence  the  first  term  in  (4.5.13)  vanishes.  ■ 

Proof  of  Proposition  f.10 

By  Proposition  C.2(2)  of  Appendix  C  and  Proposition  4.9,  it  suffices  to  show 
that 


(P~ 


i  P~ 


~(K 


P  1)~7^  y^  Ut®Ut®Ut 
1 


P: 


-1 


Vt  t  ^ 

3  Pp1)  -A=  y^(ut  -  u)  ®  (■ ut  -u)®(ut-Ti)-^Q 
V 4  t 

(4.5.15) 


and  the  fourth  moments  possess  a  similar  property.  The  result  (4.5.15)  follows 
from  Lemma  4.5  by  noting  that  P  and  Ps  are  both  consistent  estimators  of  P 
and,  for  stochastic  vectors  hr,gr  and  stochastic  matrices  Ht,Gt  with 

plim(/ix  —  gr)  =  0,  hx—*  h, 


and 


plim  HT  =  plim  Gt  =  H, 


we  get 


HtIit  —  Gt3t  =  (Ht  —  H)hr  +  H[hr  —  gT )  +  (H  —  Gx)gT  0. 


Proposition  4.10  implies  that 

A  (4.5.16) 

A k  ■=  T{b2  -  3 Ky(b2  -  3K)/24  ^x2(K), 


(4.5.17) 
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and 

Asfe  :=  As  +  Afe4x2(2/i).  (4.5.18) 

Thus,  all  three  statistics  may  be  used  for  testing  nonnormality. 

As  we  have  seen,  the  results  hold  for  any  matrix  satisfying  PP'  =  Su.  For 
example,  P  may  be  a  lower  triangular  matrix  with  positive  diagonal  obtained 
by  a  Choleski  decomposition  of  Su.  Clearly,  in  this  case  P  is  a  consistent 
estimator  of  the  corresponding  matrix  P  (see  Proposition  3.6).  Doornik  & 
Hansen  (1994)  point  out  that  with  this  choice  the  test  results  will  depend  on 
the  ordering  of  the  variables.  Therefore  they  suggest  using  a  matrix  based 
on  the  square  root  of  the  correlation  matrix  corresponding  to  Su  instead.  In 
any  case,  the  matrix  P  is  not  unique  and,  hence,  the  tests  will  depend  to 
some  extent  on  its  choice.  Strictly  speaking,  if  one  particular  P  is  found  for 
which  the  null  hypothesis  can  be  rejected,  this  result  provides  evidence  against 
the  normality  of  the  process.  Thus,  different  P  matrices  could  be  applied  in 
principle. 

For  illustrative  purposes  we  consider  our  standard  investment/income/con- 
sumption  example  from  Section  3.2.3.  Using  the  least  squares  residuals  from 
the  VAR(2)  model  with  intercepts  and  a  Choleski  decomposition  of  Su  yields 

As  =  3.15  and  A  k  =  4.69 

which  are  both  smaller  than  x2(3).go  =  6.25,  the  critical  value  of  an  asymptotic 
10%  level  test.  Also 

Asfc  =  7.84  <  x2(6).90  =  10.64. 

Thus,  based  on  these  asymptotic  tests  we  cannot  reject  the  null  hypothesis  of 
a  Gaussian  data  generation  process. 

It  was  pointed  out  by  Kilian  &  Demiroglu  (2000)  that  the  small  sample  dis¬ 
tributions  of  the  test  statistics  may  differ  substantially  from  their  asymptotic 
approximations.  Thus,  the  tests  may  not  be  very  reliable  in  practice.  Kilian 
&  Demiroglu  (2000)  proposed  bootstrap  versions  to  alleviate  the  problem. 


4.6  Tests  for  Structural  Change 

Time  invariance  or  stationarity  of  the  data  generation  process  is  an  important 
condition  that  was  used  in  deriving  the  properties  of  estimators  and  in  com¬ 
puting  forecasts  and  forecast  intervals.  Recall  that  stationarity  is  a  property 
that  ensures  constant  means,  variances,  and  autocovariances  of  the  process 
through  time.  As  we  have  seen  in  the  investment /income/consumption  exam¬ 
ple,  economic  time  series  often  have  characteristics  that  do  not  conform  with 
the  assumption  of  stationarity  of  the  underlying  data  generation  process.  For 
instance,  economic  time  series  often  have  trends  or  pronounced  seasonal  com¬ 
ponents  and  time  varying  variances.  While  these  components  can  sometimes 
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be  eliminated  by  simple  transformations,  there  remains  another  important 
source  of  nonstationarity,  namely  events  that  cause  turbulence  in  economic 
systems  in  particular  time  periods.  For  instance,  wars  usually  change  the  eco¬ 
nomic  conditions  in  some  areas  or  countries  markedly.  Also  new  tax  legisla¬ 
tion  may  have  a  major  impact  on  some  economic  variables.  Furthermore,  the 
oil  price  shocks  in  1973/74  and  1979/80  are  events  that  have  caused  drastic 
changes  in  some  variables  (notably  the  price  for  gasoline).  Such  events  may 
be  sources  of  structural  change  in  economic  systems. 

Because  stability  and,  hence,  stationarity  is  an  important  assumption  in 
our  analysis,  it  is  desirable  to  have  tools  for  checking  this  assumed  property  of 
the  data  generation  process.  In  this  section,  we  consider  two  types  of  tests  that 
can  be  used  for  this  purpose.  The  first  set  of  tests  checks  whether  a  change  in 
the  parameters  has  occurred  at  some  point  in  time  by  comparing  the  estimated 
parameters  before  and  after  the  possible  break  date.  These  tests  are  known 
as  Chow  tests.  The  second  set  of  tests  is  based  on  comparing  forecasts  with 
actually  observed  values.  More  precisely,  forecasts  are  made  prior  to  a  period 
of  possible  structural  change  and  are  compared  to  the  values  actually  observed 
during  that  period.  The  stability  or  stationarity  hypothesis  is  rejected  if  the 
forecasts  differ  too  much  from  the  actually  observed  values.  These  tests  are 
presented  in  Sections  4.6.1  and  4.6.2.  Other  tests  will  be  considered  in  later 
chapters. 

4.6.1  Chow  Tests 

Suppose  a  change  in  the  parameters  of  the  VAR(p)  process  (4.1.1)  is  suspected 
after  period  Xj  <  T.  Given  a  sample  y±, . . . ,  yx  plus  the  required  presample 
values,  the  model  can  be  set  up  as  follows  for  estimation  purposes: 

[y(1)  :  y(2)]  =  [Bi  :B2]  Z  +  [U{1)  :  U{2)]  =BZ  +  U, 

where  F(1)  :=  [yu  . . . ,  yTl\,  F(2)  :=  [yTl+i,  ■  ■  ■ ,  Vt],  U  is  partitioned  ac¬ 
cordingly,  Bi  :=  [vi,Alu...,A.pX\  and  B2  :=  [v2,A12,...,Ap2]  are  the 
( K  x  (pK  +  1))  dimensional  parameter  matrices  associated  with  the  first  ( t  = 
1, . . . ,  Tf)  and  last  (t  =  Ti  +  1, . . . ,  T)  subperiods,  respectively,  B  :=  [B\  :  B2) 
is  ( K  x  2(Kp  +  1))  dimensional  and 


Here  Z{1)  :=  [Z0, . . . ,  ZTl_  i]  and  Z(2)  '■=  [ZTl  ZT- i]  with  Z[  :=  (1,  y't, . . . , 
y't-p+ 1),  as  usual. 

In  this  model  setup,  a  test  for  parameter  constancy  checks 

H0  :  Bi  =  B2  or  [I  :  —  /]vec(B)  =  0  versus  Hi  :  Bi  /-  B2. 

Clearly,  this  is  just  a  linear  hypothesis  which  can  be  handled  easily  within 
our  LS  or  ML  framework  under  standard  assumptions.  For  example,  the  LS 
estimator  of  B  is 
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b  =  [r(1) :  r(2)]  z'(zz')-1 

=  1  1  Y(2)Z(2)(Z(2)Z(2)}  1]  • 

It  has  an  asymptotic  normal  distribution  under  the  assumptions  of  Proposition 
3.1.  To  appeal  to  that  proposition  it  has  to  be  ensured,  however,  that  T_1ZZ' 
converges  in  probability  to  a  nonsingular  matrix.  In  other  words, 

P^mij^7j7Z(i)Z'(i)i  *  =  1,2, 

has  to  exist  and  be  nonsingular.  Hence,  rl\/T  must  not  go  to  zero  when  T  goes 
to  oo,  so  that  both  subperiods  before  and  after  the  break  must  be  assumed  to 
increase  with  T.  If  the  assumptions  for  asymptotic  normality  can  be  justified, 
a  Wald  test  can,  for  example,  be  used  to  test  the  stability  hypothesis.  Alter¬ 
natively,  an  LR  or  quasi  LR  test  may  be  applied.  This  type  of  test  is  often 
given  the  label  Chow  test  in  the  literature. 

There  are  some  practical  matters  in  applying  these  tests  in  the  present 
context  that  are  worth  noting.  If  the  possible  break  date  is  very  close  to  the 
sample  beginning  or  the  sample  end,  the  LS/ML  estimators  of  Bi  may  not 
be  available  due  to  lack  of  degrees  of  freedom.  While  at  the  sample  beginning 
one  may  be  ready  to  delete  a  few  observations  to  eliminate  the  structural 
break,  this  option  is  often  undesirable  at  the  end  of  the  sample.  For  example, 
if  forecasting  is  the  objective  of  the  analysis,  a  break  towards  the  end  of  the 
sample  would  clearly  be  problematic.  Therefore,  the  so-called  Chow  forecast 
tests  have  been  proposed  which  also  work  for  break  dates  close  to  the  sample 
end.  In  the  next  subsection,  we  present  a  slightly  different  set  of  forecast  tests 
which  may  be  applied  instead. 

Even  if  the  suspected  break  point  is  well  inside  the  sample  period  so  that 
the  application  of  the  standard  Chow  test  is  unproblematic  in  principle,  in 
practice,  the  break  may  not  occur  in  one  period.  If  there  is  a  longer  time 
phase  in  which  a  parameter  shift  to  a  new  level  takes  place,  it  may  be  use¬ 
ful  to  eliminate  a  few  observations  around  the  break  date  and  use  only  the 
remaining  ones  in  estimating  the  parameters.  One  may  also  argue  that  us¬ 
ing  some  observations  from  periods  up  to  Tj  in  Z(2)  may  be  problematic  and 
may  result  in  reduced  power  because  observations  from  both  subperiods  are 
mixed  in  estimating  B2.  Under  the  null  hypothesis  of  parameter  constancy, 
this  should  be  no  problem,  however,  because,  under  H0,  the  same  process  is 
in  operation  before  and  after  1  j.  Still,  from  the  point  of  view  of  maximizing 
power,  deleting  some  observations  around  the  possible  break  point  may  be  a 
good  idea. 

Other  practical  problems  may  result  from  multiple  structural  breaks  within 
the  sample  period.  In  principle,  it  is  no  problem  to  test  multiple  break  points 
simultaneously.  Also,  to  improve  power,  one  may  only  test  some  of  the  pa¬ 
rameters  or  one  may  wish  to  test  for  a  changing  white  noise  covariance  matrix 
which  is  implicitly  assumed  to  be  time  invariant  in  the  foregoing  discussion. 
Details  of  such  extensions  will  be  discussed  in  Chapter  17. 
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So  far  we  have  considered  asymptotic  results  only.  Unfortunately,  it  was 
found  by  Candelon  &  Liitkepohl  (2001)  that  asymptotic  theory  may  be  an 
extremely  poor  guide  for  the  small  sample  properties  of  Chow  tests,  in  partic¬ 
ular,  if  models  with  many  parameters  are  under  consideration.  To  improve  the 
reliability  of  the  tests,  these  authors  proposed  to  use  bootstrapped  versions. 
Bootstrapped  p-values  may  be  obtained  as  described  in  Appendix  D.3. 

For  the  German  investment/income/consumption  data  we  have  fitted  a 
VAR(2)  model  to  data  up  to  1982.4  and  we  have  performed  a  Chow  test  for 
a  break  in  period  1979.1.  The  test  value  is  30.5.  Comparing  that  to  29.6, 
the  10%  critical  value  of  a  %2(21)  distribution,  stability  is  rejected  at  the 
10%  level.  A  bootstrapped  p- value  based  on  2000  bootstrap  replications  turns 
out  to  be  0.21,  however.  Thus,  based  on  the  bootstrapped  test,  stability  is  not 
rejected.  It  is  typical  for  the  test  based  on  the  asymptotic  y2-distribution  that 
it  rejects  more  often  in  small  samples  than  the  specified  nominal  significance 
level,  even  if  the  model  is  stable.  This  distortion  is  at  least  partly  corrected 
by  the  bootstrap. 

4.6.2  Forecast  Tests  for  Structural  Change 
A  Test  Statistic  Based  on  one  Forecast  Period 

Suppose  yt  is  a  A-dimensional  stationary,  stable  Gaussian  VAR(p)  process  as 
in  (4.1.1).  The  optimal  h- step  forecast  at  time  T  is  denoted  by  yrih)  and  the 
corresponding  forecast  error  is 

h- 1 

e-r(fi)  :=  Ur+h  —  yr(h)  =  ^  &iUT+h-i  —  [&h- 1  ■  ■  ■  ■  ■  @i  '■  Ik]  u r,h 

i= o 

(4.6.1) 

where  u t,h  '■=  (wy+1, . . . ,  w(r_|_/l),j  the  (P,  are  the  coefficient  matrices  of  the 
canonical  MA  representation  (see  Section  2.2.2).  Because  u r,h  ~  Af(0,lh  ® 
Su),  the  forecast  error  is  a  linear  transformation  of  a  multivariate  normal 
distribution  and,  consequently  (see  Appendix  B), 

eT{h)  ~  A/"(0,  Uy(h)),  (4.6.2) 

where 


h- 1 

Ey{h)  =  Y,*iZu*'i 

i= 0 

is  the  forecast  MSE  matrix  (see  (2.2.11)).  Hence, 

rh  ■=  eT(h)' Sy(h)~1eT(h)  ~  %2(A')  (4.6.3) 


by  Proposition  B.3  of  Appendix  B. 
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This  derivation  assumes  that  yr+h  is  generated  by  the  same  VAR(p)  pro¬ 
cess  that  has  generated  the  yt  for  t  <  T.  If  this  process  does  not  prevail  in 
period  T+h,  the  statistic  Th  will,  in  general,  not  have  a  central  ^-distribution. 
Hence,  Th  may  be  used  to  test  the  null  hypothesis 

H0:  (4.6.2)  is  true,  that  is,  yr+h  is  generated  by  the  same  Gaussian  VAR(p) 
process  that  has  generated  j/i, . . . ,  j/t- 

The  alternative  hypothesis  is  that  yr+h  is  not  generated  by  the  same  process 
as  ,j/t-  The  null  hypothesis  is  rejected  if  the  forecast  errors  are  large 

so  that  Th  exceeds  a  prespecified  critical  value  from  the  y2(A')-distribution. 
Such  a  test  may  be  performed  for  h  =  1,2,.... 

It  may  be  worth  noting  that  in  these  tests  we  also  check  the  normality  as¬ 
sumption  for  yt.  Even  if  the  same  process  has  generated  yr+h  and  yi, . . . ,  yr, 
(4.6.2)  will  not  hold  if  that  process  is  not  Gaussian.  Thus,  the  normality  as¬ 
sumption  for  yt  is  part  of  4/0.  Other  possible  deviations  from  the  null  hypoth¬ 
esis  include  changes  in  the  mean  and  changes  in  the  variance  of  the  process. 

In  practice,  the  tests  are  not  feasible  in  their  present  form  because  Th 
involves  unknown  quantities.  The  forecast  errors  exih)  and  the  MSE  matrix 
Sy{h)  are  both  unknown  and  must  be  replaced  by  estimators.  For  the  forecast 
errors,  we  use 


h-l 

eT(h)  :=  yT+h  ~  Vt{h)  =  ^  $iUT+h-i ,  (4.6.4) 

*=o 

where  the  are  obtained  from  the  coefficient  estimators  A.i  in  the  usual  way 
(see  Section  3.5.2)  and 

ut~yt-v-  Myt-i - Apyt_p. 

The  MSE  matrix  may  be  estimated  by 


h-l 

Zy{h)  (4.6.5) 

i= 0 

where  Su  is  the  LS  estimator  of  Su.  As  usual,  we  use  only  data  up  to  period 
T  for  estimation  and  not  the  data  from  the  forecast  period.  If  the  conditions 
for  consistency  of  the  estimators  are  satisfied,  that  is, 

plim  v  =  v1  plim  A.i  =  Ai,i  =  1 , ,p,  and  plim  Eu  =  Su, 

then  plim  <Pi  =  plim  £y{h)  =  Sy{h)  and 

plim (ut-ut)  =  plim(i/ -  P)  +  plim(Ai  -  Ai)yt_i  4 - 

+  Plim  (Ay  -  Ay)y(_y 
=  0. 
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Hence,  defining 

Tk  :=eT(hySy(h)eT(h), 

we  get  plim  {rh  —  rh)  =  0  and,  thus,  by  Proposition  C.2(2)  of  Appendix  C, 
?hSx2(K).  (4.6.6) 

In  other  words,  if  the  unknown  coefficients  are  replaced  by  consistent  estima¬ 
tors,  the  resulting  test  statistics  T>t  have  the  same  asymptotic  distributions  as 
the  rh. 

Of  course,  it  is  desirable  to  know  whether  the  x2  (iC)-distribution  is  a  good 
approximation  to  the  distribution  of  77,  in  small  samples.  This,  however,  is  not 
likely  because  in  Section  3.5.1, 

Ey{h)  =  S v(h)  +  (4.6.7) 

was  found  to  be  a  better  approximation  to  the  MSE  matrix  than  Uy(h),  if 
the  forecasts  are  based  on  an  estimated  process.  While  asymptotically,  as 
T  — >  oo,  the  term  fI{h)/T  vanishes,  it  seems  plausible  to  include  this  term 
in  small  samples.  For  univariate  processes,  it  was  confirmed  in  a  simulation 
study  by  Liitkepohl  (1988b)  that  inclusion  of  the  term  results  in  a  better 
agreement  between  the  small  sample  and  asymptotic  distributions.  For  mul¬ 
tivariate  vector  processes,  the  simulation  results  of  Section  3.5.4  point  in  the 
same  direction.  Thus,  in  small  samples  a  statistic  of  the  type 

eT(hySy(h)-leT(h) 

is  more  plausible  than  t^.  Here  £y{h)  is  the  estimator  given  in  Section  3.5.2. 
In  addition  to  this  adjustment,  it  is  useful  to  adjust  the  statistic  for  using 
an  estimated  rather  than  known  forecast  error  covariance  matrix.  Such  an 
adjustment  is  often  done  by  dividing  by  the  degrees  of  freedom  and  using  the 
statistic  in  conjunction  with  critical  values  from  an  E-distribution.  That  is, 
we  may  use 

rh  :=  eT(/i),i:g(/l)-1eT(ft.)/E  «  F(K,  T  —  Kp  -  1).  (4.6.8) 

The  approximate  E-distribution  follows  from  Proposition  C.3(2)  of  Appendix 
C  and  the  denominator  degrees  of  freedom  are  chosen  by  analogy  with  a 
result  due  to  Hotelling  (e.g.,  Anderson  (1984)).  Other  choices  are  possible. 
Proposition  C.3(2)  requires,  however,  that  the  denominator  degrees  of  freedom 
go  to  infinity  with  the  sample  size  T. 

A  Test  Based  on  Several  Forecast  Periods 

Another  set  of  stationarity  tests  is  obtained  by  observing  that  the  errors  of 
forecasts  1-  to  h-steps  ahead  are  also  jointly  normally  distributed  under  the 
null  hypothesis  of  structural  stability, 
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er(h)  ■— 


eT(l) 


er{h) 


where 


^fcUr,fc  ~  A/"(0,  Sy(/i)), 


lK  0  . . .  0 
<I>1  7*  0 


$h-l  @h- 2  •  •  •  Ik 


so  that 


(4.6.9) 


(4.6.10) 


(4.6.11) 


Using  again  Proposition  B.3  of  Appendix  B, 

A h  '■=  eT(h)"Zy(h)~1er(h)  =  u'T  h(lh  <g>  Z’“1)uT^ 
h 

=  uT+iSuluT+i  =  Ah-1  +  M,x+ftA7-1uT+?l  ~  X2(hI<).  (4.6.12) 

2=1 

Thus,  A/j  may  be  used  to  check  whether  a  structural  change  has  occurred 
during  the  periods  T  +  1, ...  ,T  +  h. 

To  make  this  test  feasible,  it  is  necessary  to  replace  unknown  quantities 
by  estimators  just  as  in  the  case  of  the  r-tests.  Denoting  the  test  statistics 
based  on  estimated  VAR  processes  by  A h, 


A  h^X2(hK) 


(4.6.13) 


follows  with  the  same  arguments  used  for  r>(,  provided  consistent  parameter 
estimators  are  used. 

Again  it  seems  plausible  to  make  small  sample  adjustments  to  the  statistics 
to  take  into  account  the  fact  that  estimated  quantities  are  used.  The  last 
expression  in  (4.6.12)  suggests  that  a  closer  look  at  the  terms 

uT+iS~1uT+ *  (4.6.14) 

is  useful  in  searching  for  a  small  sample  adjustment.  This  expression  involves 
the  1-step  ahead  forecast  errors  ur+i  =  yr+i  —  VT+i- i(l).  If  estimated  co¬ 
efficients  are  used  in  the  1-step  ahead  forecast,  the  MSE  or  forecast  error 
covariance  matrix  is  approximately  inflated  by  a  factor  (T  +  Kp  +  1)/T  (see 
(3.5.13)).  Because  A h  is  the  sum  of  terms  of  the  form  (4.6.14),  it  may  be 
suitable  to  replace  Su  by  (T  +  Kp+  1  )EU/T  when  estimated  quantities  are 
used.  Note,  however,  that  such  an  adjustment  ignores  possible  dependencies 
between  the  estimated  Ut+i  and  Ut+j ■  Nevertheless,  it  leads  to  a  computa¬ 
tionally  extremely  simple  form  and  was  therefore  proposed  in  the  literature 
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(Llitkepohl  (1989b)).  Furthermore,  it  was  suggested  to  divide  by  the  degrees 
of  freedom  of  the  asymptotic  ^-distribution  and,  by  appeal  to  Proposition 
C.3(2)  of  Appendix  C,  use  the  resulting  statistic,  A h  say,  in  conjunction  with 
critical  values  from  F-distributions  to  adjust  for  the  fact  that  £u  is  replaced 
by  an  estimator.  In  other  words, 

h 

Xh  ■.=  TYJ^T+iK1^T+i/[{T+Kp+l)Kh]  «  F(Kh,T-Kp- 1).  (4.6.15) 

i=l 

The  denominator  degrees  of  freedom  are  chosen  by  the  same  arguments  used 
in  (4.6.8).  Obviously,  Ai  =T\. 

Now  we  have  different  sets  of  stationarity  tests  and  the  question  arises 
which  ones  to  use  in  practice.  To  answer  this  question,  it  would  be  useful 
to  know  the  power  characteristics  of  the  tests  because  it  is  desirable  to  use 
the  most  powerful  test  available.  For  some  alternatives  the  r-  and  A-statistics 
have  noncentral  y2 -distributions  (Liitkepohl  (1988b,  1989)).  In  these  cases 
it  is  possible  to  investigate  and  compare  their  powers.  It  turns  out  that  for 
some  alternatives  the  r-tests  are  more  powerful  than  the  A-tests  and  for  other 
alternatives  the  opposite  is  true.  Because  we  usually  do  not  know  the  exact 
form  of  the  alternative  (the  exact  form  of  the  structural  change)  it  may  be  a 
good  idea  to  apply  both  tests  in  practice.  In  addition,  a  Chow  test  may  be 
used. 


An  Example 

To  illustrate  the  use  of  the  two  tests  for  stationarity,  we  use  the  first  differences 
of  logarithms  of  the  West  German  investment,  income,  and  consumption  data 
and  test  for  a  possible  structural  change  caused  by  the  oil  price  shocks  in 
1973/74  and  1979/80.  Because  the  first  drastic  price  increase  occurred  in  late 
1973,  we  have  estimated  a  VAR(2)  model  using  the  sample  period  1960.4- 
1973.2  and  presample  values  from  1960.2  and  1960.3.  Thus  T  =  51.  It  is 
important  to  note  that  the  data  from  the  forecast  period  are  not  used  for 
estimation.  We  have  used  the  estimated  process  to  compute  the  r/,  and  A/, 
for  h  =  1, . . . ,  8.  The  results  are  given  in  Table  4.9  together  with  the  p-values 
of  the  tests.  The  p-value  is  the  probability  that  the  test  statistic  assumes  a 
value  greater  than  the  observed  test  value,  if  the  null  hypothesis  is  true.  Thus, 
p-values  smaller  than  .10  or  .05  would  be  of  concern.  Obviously,  in  this  case 
none  of  the  test  values  is  statistically  significant  at  the  10%  level.  Thus,  the 
tests  do  not  give  rise  to  concern  about  the  stationarity  of  the  underlying  data 
generation  process  during  the  period  in  question.  Although  we  have  given  the 
t h  and  Xh  values  for  various  forecast  horizons  h  in  Table  4.9,  we  emphasize 
that  the  tests  are  not  independent  for  different  h.  Thus,  the  evidence  from 
the  set  of  tests  should  not  lead  to  overrating  the  confidence  we  may  have  in 
this  result. 
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Table  4.9.  Stability  tests  for  the  investment/income/consumption 
system  for  1973-1975 


quarter 

forecast 

horizon 

h 

Th 

p- value 

A  h 

p- value 

1973.3 

1 

.872 

.46 

.872 

.46 

4 

2 

.271 

.85 

.717 

.64 

1974.1 

3 

.206 

.89 

.517 

.85 

2 

4 

.836 

.48 

.627 

.81 

3 

5 

.581 

.63 

.785 

.69 

4 

6 

.172 

.91 

.832 

.65 

1975.1 

7 

.126 

.94 

.863 

.63 

2 

8 

1.450 

.24 

1.041 

.44 

To  check  the  possibility  of  a  structural  instability  due  to  the  1979/80  oil 
price  increases,  we  used  the  VAR(2)  model  of  Section  3.2.3  which  is  based  on 
data  up  to  the  fourth  quarter  of  1978.  The  resulting  values  of  the  test  statis¬ 
tics  for  h  =  1, . . . ,  8  are  presented  in  Table  4.10.  Again  none  of  the  values  is 
significant  at  the  10%  level.  However,  in  Section  3.5.2,  we  found  that  the  ob¬ 
served  consumption  values  in  1979  fall  outside  a  95%  forecast  interval.  Hence, 
looking  at  the  three  series  individually,  a  possible  nonstationarity  would  be 
detected  by  a  prediction  test.  This  possible  instability  in  1979  was  a  reason  for 
using  only  data  up  to  1978  in  the  examples  of  previous  chapters  and  sections. 
The  example  indicates  what  can  also  be  demonstrated  theoretically,  namely 
that  the  power  of  a  test  based  on  joint  forecasts  of  various  variables  may  be 
lower  than  the  power  of  a  test  based  on  forecasts  for  individual  variables  (see 
Liitkepohl  (1989b)). 


4.7  Exercises 

4.7.1  Algebraic  Problems 

Problem  4-1 

Show  that  the  restricted  ML  estimator  f3r  can  be  written  in  the  form  (4.2.10). 

Problem  4-2 
Prove  Lemma  4.1. 

[Hint:  Suppose  k  <  n.  Then 
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Table  4.10.  Stability  tests  for  the  investment/income/consumption 
system  for  1979-1980 


quarter 

forecast 

horizon 

h 

Th 

p- value 

A  h  p- 

value 

1979.1 

1 

.277 

.84 

.277 

.84 

2 

2 

2.003 

.12 

1.077 

.38 

3 

3 

2.045 

.12 

1.464 

.18 

4 

4 

.203 

.89 

1.245 

.27 

1980.1 

5 

.630 

.60 

1.339 

.20 

2 

6 

1.898 

.86 

1.374 

.17 

3 

7 

.188 

.90 

1.204 

.28 

4 

8 

.535 

.66 

1.124 

.34 

Cn  “h  CLn 

=  Cn  “1“  {P'n 
^  cn  +  (6n 

On_l 

^n—  1 

)  +  •••  +  (flfc+i 
+  •  •  •  +  (frfc+i 

-  ak )  +  ak 

-  bk)  +  cjfc 

P  c&  T  bk  bk  T  cik 
—  Ck  d-  &k 

which  contradicts  (4.3.13b).2] 


Problem  4-3 

Show  (4.4.14)  and  (4.4.15). 
[Hint: 


vec {VfUF[Ih  ®  Z\B  -  B)']/T) 


=  VT 


vec 


-UF{lh.®Z')(Ih®(B-B)') 


Ik  ®  y  C/^(4  ®  Z') 


Vf  vec(4  ®  (B  -  B)') 


and 


VT  vec  f  -(B  -  B)ZF  \lh  0  Z'{B  -  B)' 


lh  ®{B  -  B) 


(h  ®  Z)F'Z' 
T 


Ik  |  VT  vec (B  —  B). 


Problem  4-4 
Show  (4.5.14). 

2  I  thank  Prof.  K.  Schiirger,  Universitat  Bonn,  for  pointing  out  this  proof. 
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j^Hint:  Note  that 

(yt- 1  -  y)  <8>  (tit  -  u)  ®  ( Ut  -u)  =  (j/t-1  -  Ai)  ®  Ut  <g>  tit 

-(t/t- 1  —  y)  ®  Ut  ®u  A - , 

define  new  variables  of  the  type 

zt  =  (t/t- 1  -  /i)  ®  ttt  ®  tit 


and  use  that 

Plim^  =  ^(^t)  =  °-] 

t 


Problem  4-5 

Using  the  notation  and  assumptions  from  Proposition  4.1,  show  that 


dlnl(Pr)  d2  In l((3r)  dlnl(Pr) 
dp1  apap'  op 


{Pr-P)'{ZZ'®{El)-*){Pr-P). 


4.7.2  Numerical  Problems 

The  following  problems  require  the  use  of  a  computer.  They  refer  to  the  bi¬ 
variate  series  yt  =  (yit,y2t)'  of  first  differences  of  the  U.S.  investment  data  in 
File  E2,  available  from  the  author’s  webpage. 

Problem  4-6 

Set  up  a  sequence  of  tests  for  the  correct  VAR  order  of  the  data  generating 
process  using  a  maximum  order  of  M  =  4.  Compute  the  required  %2  and  F 
likelihood  ratio  statistics.  Which  order  would  you  choose? 

Problem  4-1 

Determine  VAR  order  estimates  on  the  basis  of  the  four  criteria  FPE,  AIC, 
HQ,  and  SC.  Use  a  maximum  VAR  order  of  M  =  4  in  a  first  estimation  round 
and  M  =  8  in  a  second  estimation  round.  Compare  the  results. 

Problem  4-8 

Compute  the  residual  autocorrelations  Ri , . . . ,  R\2  and  estimate  their  stan¬ 
dard  errors  using  the  VAR(l)  model  obtained  in  Problem  3.12.  Interpret  your 
results. 

Problem  4-9 

Compute  LM  test  values  Xlm{ 1)>  Xlm{ 2),  and  A/,m(4)  and  portmanteau  test 
values  Qh  and  Qh  for  h  =  10  and  12  for  the  VAR(l)  model  of  the  previous 
problem.  Test  the  whiteness  of  the  residuals. 

Problem  4-10 

On  the  basis  of  a  VAR(l)  model,  perform  a  test  for  nonnormality  of  the 
example  data. 
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Problem  ^.11 

Investigate  whether  there  was  a  structural  change  in  U.S.  investment  after 
1965  (possibly  due  to  the  increasing  U.S.  engagement  in  Vietnam). 


5 


VAR  Processes  with  Parameter  Constraints 


5.1  Introduction 

In  Chapter  3,  we  have  discussed  estimation  of  the  parameters  of  a  K- 
dimensional  stationary,  stable  VAR(p)  process  of  the  form 

yt  =  v  +  Myt-i  H - f-  Apyt-p  +  ut,  (5.1.1) 

where  all  the  symbols  have  their  usual  meanings.  In  the  in  vestment /in¬ 
come/consumption  example  considered  throughout  Chapter  3,  we  found  that 
many  of  the  coefficient  estimates  were  not  significantly  different  from  zero. 
This  observation  may  be  interpreted  in  two  ways.  First,  some  of  the  coeffi¬ 
cients  may  actually  be  zero  and  this  fact  may  be  reflected  in  the  estimation 
results.  For  instance,  if  some  variable  is  not  Granger-causal  for  the  remaining 
variables,  zero  coefficients  are  encountered.  Second,  insignificant  coefficient 
estimates  are  found  if  the  information  in  the  data  is  not  rich  enough  to  pro¬ 
vide  sufficiently  precise  estimates  with  confidence  intervals  that  do  not  contain 
zero. 

In  the  latter  case,  one  may  want  to  think  about  better  ways  to  extract 
the  information  from  the  data  because,  as  we  have  seen  in  Chapter  3,  a  large 
estimation  uncertainty  for  the  VAR  coefficients  leads  to  poor  forecasts  (large 
forecast  intervals)  and  imprecise  estimates  of  the  impulse  responses  and  fore¬ 
cast  error  variance  components.  Getting  imprecise  parameter  estimates  in  a 
VAR  analysis  is  a  common  practical  problem  because  the  number  of  parame¬ 
ters  is  often  quite  substantial  relative  to  the  available  sample  size  or  time  series 
length.  Various  cures  for  this  problem  have  been  proposed  in  the  literature. 
They  all  amount  to  putting  constraints  on  the  coefficients. 

For  instance,  in  the  previous  chapter,  choosing  the  VAR  order  p  has  been 
discussed.  Selecting  an  order  that  is  less  than  the  maximum  order  amounts  to 
placing  zero  constraints  on  VAR  coefficient  matrices.  This  way  complete  coef¬ 
ficient  matrices  are  eliminated.  In  the  present  chapter,  we  will  discuss  putting 
zero  constraints  on  individual  coefficients.  Such  constraints  are  but  one  form 
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of  linear  restrictions  which  will  be  treated  in  Section  5.2.  Nonlinear  constraints 
are  considered  in  Section  5.3  and  Bayesian  estimation  is  the  subject  of  Section 

5.4. 


5.2  Linear  Constraints 

In  this  section,  the  consequences  of  estimating  the  VAR  coefficients  subject 
to  linear  constraints  will  be  considered.  Different  estimation  procedures  are 
treated  in  Subsections  5. 2. 2-5. 2. 5;  forecasting  and  impulse  response  analysis 
are  discussed  in  Subsections  5.2.6  and  5.2.7,  respectively;  strategies  for  model 
selection  or  the  choice  of  constraints  are  dealt  with  in  Subsection  5.2.8;  model 
checking  follows  in  Subsection  5.2.9;  and,  finally,  an  example  is  discussed  in 
Subsection  5.2.10. 

5.2.1  The  Model  and  the  Constraints 

We  consider  the  model  (5.1.1)  for  t  =  1, . . .  ,T,  written  in  compact  form 
Y  =  BZ  +  U,  (5.2.1) 

where 

1 

Vt 

y  ■=  [yi,  ■  ■  ■  ,Vt\,  Z  :=  [Zo,  ■  ■  ■  ,yT-i\  with  Zt  := 

Vt—p+i 

B  :=  [v,Ax,...,Ap],  U  :=  [ui,...,uT\. 

Suppose  that  linear  constraints  for  B  are  given  in  the  form 

(3  :=  vec (B)  =  R*y  +  r,  (5.2.2) 

where  /3  =  vec (B)  is  a  (K(Kp+ 1)  x  1)  vector,  R  is  a  known  (K(Kp+ 1)  x  M) 
matrix  of  rank  M ,  7  is  an  unrestricted  ( M  x  1)  vector  of  unknown  param¬ 
eters,  and  r  is  a  K  (Kp  +  l)-dimensional  vector  of  known  constants.  All  the 
linear  restrictions  of  interest  can  be  expressed  in  this  form.  For  instance,  the 
restriction  Ap  =  0  can  be  written  as  in  (5.2.2)  by  choosing  M  =  K2{p  —  1  )+K, 

R=  gM  ,  7  =  vec(is,  A-!,...,  Ap_  1), 

and  r  =  0. 

Although  (5.2.2)  is  not  the  most  conventional  form  of  representing  linear 
constraints,  it  is  used  here  because  it  is  particularly  useful  for  our  purposes. 
Often  the  constraints  are  expressed  as 
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C/3  =  c,  (5.2.3) 

where  C  is  a  known  (TV  x  ( K2p  +  K))  matrix  of  rank  TV  and  c  is  a  known 
(TV  x  1)  vector  (see  Chapter  4,  Section  4.2.2).  Because  rk(C)  =  TV,  the  matrix 
C  has  TV  linearly  independent  columns.  For  simplicity  we  assume  that  the  first 
TV  columns  are  linearly  independent  and  partition  C  as  C  =  [C i  :  C2],  where 
Ci  is  (TV  x  TV)  nonsingular  and  C2  is  (TV  x  ( K2p  +  K  —  TV)).  Partitioning  /3 
conformably  gives 


[Cr  :  C2] 


Pi 

P 2 


C'lPi  +  C2P2  —  c 


or 


Pi  =  ~C1  1C2P2  +  C\  1c. 

Therefore,  choosing 

CfC'  " 

0  J  ’ 

the  constraints  (5.2.3)  can  be  written  in  the  form  (5.2.2).  Also,  it  is  not  difficult 
to  see  that  restrictions  written  as  in  (5.2.2)  can  be  expressed  in  the  form 
C/3  =  c  for  suitable  C  and  c.  Thus,  the  two  forms  are  equivalent. 

The  representation  (5.2.2)  permits  to  impose  the  constraints  by  a  simple 
reparameterization  of  the  original  model.  Vectorizing  (5.2.1)  and  replacing  p 
by  Rj  +  r  gives 

y  :=  vec(F)  =  (Zr  ®  1K)  vec(B)  +  vec(C) 

=  (Zf  ®  1  k){Ri  +  r)  +  u 


R  = 


~CiXC2 
I  lpK2+K-N 


T  =  P21 


and 


or 


z  =  (Z'  ®Ik)Rj  +  u,  (5.2.4) 

where  z  :=  y  —  (Z' ®  I x)r  and  u  :=  vec (U).  This  form  of  the  model  allows 
us  to  derive  the  estimators  and  their  properties  just  like  in  the  original  un¬ 
constrained  model.  Estimation  of  7  and  P  will  be  discussed  in  the  following 
subsections. 

5.2.2  LS,  GLS,  and  EGLS  Estimation 
Asymptotic  Properties 

Denoting  by  Su  the  covariance  matrix  of  Ut ,  the  vector  7  minimizing 
5(7)  =  u'(/r®  Z^u 

=  [z  -  (Z’  ®  I K)Rl]' (It  ®  ^u1)^  ~  ®  1k)R~i] 


(5.2.5) 
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with  respect  to  7  is  easily  seen  to  be 

7  =  [r'(zz' ®z-1)r]-1r'(z®z:-1)z 

=  [R'{ZZ'  ®  S-^R^R'iZ  0  S-1)  l(Z'  0  7x)i?7  +  u] 

=  7  +  [#'(XX'  i:”1)#]-1  i?'(/KP+i  ®  X1”1)  vec (UZ')  (5.2.6) 

(see  Chapter  3,  Section  3.2.1).  This  estimator  is  commonly  called  a  generalized 
LS  ( GLS )  estimator  because  it  minimizes  the  generalized  sum  of  squared 
errors  5(7)  rather  than  the  sum  of  squared  errors  u'u.  We  will  see  shortly 
that  in  contrast  to  the  unrestricted  case  considered  in  Chapter  3,  it  may  make 
a  difference  here  whether  S( 7)  or  u'u  is  used  as  the  objective  function.  The 
GLS  estimator  is  in  general  asymptotically  more  efficient  than  the  multivariate 
LS  estimator  and  is  therefore  preferred  here.  We  will  see  in  Section  5.2.3 
that,  under  Gaussian  assumptions,  the  GLS  estimator  is  equivalent  to  the 
ML  estimator.  From  (5.2.6), 

T  ( ZZ’  \  1 -1  1 

v/T(7  -  7)  =  Bf  [-jr  ®  Zu1)  R\Ikp+i  <g>  X"1)—  vec(UZ') 

(5.2.7) 

and  the  asymptotic  properties  of  7  are  obtained  as  in  Proposition  3.1. 

Proposition  5.1  ( Asymptotic  Properties  of  the  GLS  Estimator) 

Suppose  the  conditions  of  Proposition  3.1  are  satisfied,  that  is,  yt  is  a  K- 
dimensional  stable,  stationary  VAR(p)  process  and  ut  is  independent  white 
noise  with  bounded  fourth  moments.  If  f3  =  Rj  +  r  as  in  (5.2.2)  with  rk(f?)  = 
M,  then  7  given  in  (5.2.6)  is  a  consistent  estimator  of  7  and 

Vt( 7  -  7)  4 U(0,  [R'(r  ®  X-1)^]-1),  (5.2.8) 

where  T  :=  E{ZtZ't)  =  plim  ZZ' /T.  U 

Proof:  Under  the  conditions  of  the  proposition,  plim(XX'/T)  =  r  and 

Y  vec(t/X')4AA(0,r®Xu) 

(see  Lemma  3.1).  Hence,  by  results  stated  in  Appendix  C,  Proposition  C.15(l), 
using  (5.2.7),  VT( 7  —  7)  has  an  asymptotic  normal  distribution  with  covari¬ 
ance  matrix 

[R\r  ®  E-^R^R'il  0  X-x)(r  0  X„)(J  ®  X-1)^^'^  ®  E~1)R]~1 
=  lR\r®z-1)R}-1. 


Unfortunately,  the  estimator  7  is  of  limited  value  in  practice  because  its 
computation  requires  knowledge  of  X„.  Since  this  matrix  is  usually  unknown, 


5.2  Linear  Constraints 


197 


it  has  to  be  replaced  by  an  estimator.  Using  any  consistent  estimator  Eu 
instead  of  Eu  in  (5.2.6),  we  get  an  EGLS  (estimated  GLS)  estimator 

7  =  [R'(ZZ'  ®  Ef^R^RfZ  ®  Efx)z  (5.2.9) 

which  has  the  same  asymptotic  properties  as  the  GLS  estimator  7.  This  result 
is  an  easy  consequence  of  the  representation  (5.2.7)  and  Proposition  C.15(l) 
of  Appendix  C. 

Proposition  5.2  ( Asymptotic  Properties  of  the  EGLS  Estimator ) 

Under  the  conditions  of  Proposition  5.1,  if  plim  Eu  =  Eu,  the  EGLS  estimator 
7  in  (5.2.9)  is  asymptotically  equivalent  to  the  GLS  estimator  7  in  (5.2.6), 
that  is,  plim  7  =  7  and 

Vt(7  -  7)  ^(0,  [R\r  ®  E-^R]-1).  (5.2.10) 


Once  an  estimator  for  7  is  available,  an  estimator  for  (3  is  obtained  by 
substituting  in  (5.2.2),  that  is, 

3  =  i?7  +  r.  (5.2.11) 

The  asymptotic  properties  of  this  estimator  follow  immediately  from  Ap¬ 
pendix  C,  Proposition  C.15(2). 

Proposition  5.3  ( Asymptotic  Properties  of  the  Implied  Restricted  EGLS  Es¬ 
timator) 

Under  the  conditions  of  Proposition  5.2,  the  estimator  (3  =  i?7+r  is  consistent 
and  asymptotically  normally  distributed, 

VT@  -  (3)  4  U(0,  R[R'{r  ®  E~1)R}~1R').  (5.2.12) 


To  make  these  EGLS  estimators  operational,  we  need  a  consistent  estima¬ 
tor  of  Eu.  From  Chapter  3,  Corollary  3.2.1,  we  know  that,  under  the  conditions 
of  Proposition  5.1, 

Eu  =  - - ~ - -(Y  -  BZ)(Y  -  BZ)' 


T  —  Kp  —  1 
1 

T  -  Kp-  1 


Y(I±  -  Z'(ZZ')  1  Z)Y' 


(5.2.13) 


is  a  consistent  estimator  of  Eu  which  may  thus  be  used  in  place  of  Eu.  Here 
B  =  Y Z' (Z Z')~x  is  the  unconstrained  multivariate  LS  estimator  of  the  coef¬ 
ficient  matrix  B. 
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Alternatively,  the  restricted  LS  estimator  minimizing  u'u  with  respect  to 
7  may  be  determined  in  a  first  step.  The  minimizing  7-vector  is  easily  seen 
to  be 

7  =  [R'{ZZ'  <g>  1k)R\~1R!{Z  ®  lK) z  (5.2.14) 

(see  Problem  5.1).  As  this  LS  estimator  does  not  involve  the  white  noise 
covariance  matrix  Su,  it  is  generally  different  from  the  GLS  estimator.  We 
denote  the  corresponding  /3-vector  by  /3,  that  is,  (3  =  Rr/  +  r.  Furthermore, 
B  is  the  corresponding  coefficient  matrix,  that  is,  vec (B)  =  (3.  Then  we  may 
choose 

Su=^(Y-BZ)(Y-BZy  (5.2.15) 

as  an  estimator  for  Eu.  The  consistency  of  this  estimator  is  a  consequence 
of  Proposition  3.2  and  the  fact  that  B  is  a  consistent  estimator  of  B  with 
asymptotic  normal  distribution.  This  result  follows  from  the  asymptotic  nor¬ 
mality  of  7  which  in  turn  follows  by  replacing  Su  with  Ik  in  (5.2.6)  and 
(5.2.7).  Thus,  0  =  R.-y  +  r  is  asymptotically  normal.  Consequently,  we  get  the 
following  result  from  Proposition  3.2  and  Corollary  3.2.1. 

Proposition  5.4  ( Asymptotic  Properties  of  the  White  Noise  Covariance  Es¬ 
timator) 

Under  the  conditions  of  Proposition  5.1,  Eu  is  consistent  and 
plimv/T(U„  -  UU' IT)  =  0. 


In  (5.2.15),  T  may  be  replaced  by  T  —  Kp—  1  without  affecting  the  consis¬ 
tency  of  the  covariance  matrix  estimator.  However,  there  is  little  justification 
for  subtracting  Kp  +  1  from  T  in  the  present  situation  because,  due  to  zero 
restrictions,  some  or  all  of  the  K  equations  of  the  system  may  contain  fewer 
than  Kp  +  1  parameters. 

Of  course,  in  practice  one  would  like  to  know  which  one  of  the  possible 
covariance  estimators  leads  to  an  EGLS  estimator  7  with  best  small  sample 
properties.  Although  we  cannot  give  a  general  answer  to  this  question,  it 
seems  plausible  to  use  an  estimator  that  takes  into  account  the  nonsample 
information  concerning  the  VAR  coefficients,  provided  the  restrictions  are 
correct.  Thus,  if  one  is  confident  about  the  validity  of  the  restrictions,  the 
covariance  matrix  estimator  Eu  may  be  used. 

As  an  alternative  to  the  EGLS  estimator  described  in  the  foregoing,  an 
iterated  EGLS  estimator  may  be  used.  It  is  obtained  by  computing  a  new 
covariance  matrix  estimator  from  the  EGLS  residuals.  This  estimator  is  then 
used  in  place  of  Eu  in  (5.2.9)  and  again  a  new  covariance  matrix  estimator 
is  computed  from  the  corresponding  residuals  and  so  on.  The  procedure  is 
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continued  until  convergence.  We  will  not  pursue  it  here.  From  Propositions 
5.2  and  3.2  it  follows  that  the  asymptotic  properties  of  the  resulting  iterated 
EGLS  estimator  are  the  same  as  those  of  the  EGLS  estimator  wherever  the 
iteration  is  terminated. 


Comparison  of  LS  and  Restricted  EGLS  Estimators 


A  question  of  interest  in  this  context  is  how  the  covariance  matrix  in  (5.2.12) 
compares  with  the  asymptotic  covariance  matrix  r _1  ®SU  of  the  unrestricted 
multivariate  LS  estimator  (3.  To  see  that  the  restricted  estimator  has  smaller 
or  at  least  not  greater  asymptotic  variances  than  the  unrestricted  estimator, 
it  is  helpful  to  write  the  restrictions  in  the  form  (5.2.3).  In  that  case,  the 
restricted  EGLS  estimator  of  (3  turns  out  to  be 

3  =  3  +  l(ZZ')-1  ®  ZulC'iCiiZZ'y1  ®  ^„)C,]-1(c: -  c3)  (5.2.16) 

(see  Chapter  4,  Section  4.2.2,  and  Problem  5.2).  Noting  that  C(3  —  c  =  0, 
subtracting  (3  from  both  sides  of  (5.2.16),  and  multiplying  by  \/T  gives 


VT0  -  (3)  =  Vf( 3  -13)-  FtVT0  -  (3)  =  (1K*P+K  -  Ft)VT{ 3  -  (3), 


where 

Ft 


/ ZZ'\~X  -  ' 

®  Sn 

c 

\  T  ) 

C' 


C 


so  that 


F  :=  plim  FT  =  (T_1  ®  A„)C,[C'(r-1  ®  EU)C']~1C. 

Thus,  the  covariance  matrix  of  the  asymptotic  distribution  of  VT(/3  —  (3)  is 
(I-F)^-1®  EU)(I-F)' 

=  r-1  ®zu-  ( r -1  ®  su)F'  -  Fir-1  ®  su)  +  f (+-1  ®  su)f' 

=  r-1  ®zu-  (+-1  ®  u^c'icir-1  ®  A„)c']-1c,(r-1  ®  su). 

In  other  words,  a  positive  semidefinite  matrix  is  subtracted  from  the  covari¬ 
ance  matrix  T^1  ®  Su  to  obtain  the  asymptotic  covariance  matrix  of  the 
restricted  estimator.  Hence,  the  asymptotic  variances  of  the  latter  will  be 
smaller  than  or  at  most  equal  to  those  of  the  unrestricted  multivariate  LS  es¬ 
timator.  Because  the  two  ways  of  writing  the  restrictions  in  (5.2.3)  and  (5.2.2) 
are  equivalent,  the  EGLS  estimator  of  (3  subject  to  restrictions  (3  =  R-y  +  r 
must  also  be  asymptotically  superior  to  the  unconstrained  estimator.  In  other 
words, 

r-1  ®su-  R[R\r  ®  ij-1)r]-1ri 
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is  positive  semidefinite.  This  result  shows  that  imposing  restrictions  is  advan¬ 
tageous  in  terms  of  asymptotic  efficiency.  It  must  be  kept  in  mind,  however, 
that  the  restrictions  are  assumed  to  be  valid  in  the  foregoing  derivations.  In 
practice,  there  is  usually  some  uncertainty  with  respect  to  the  validity  of  the 
constraints. 


5.2.3  Maximum  Likelihood  Estimation 

So  far  in  this  chapter,  no  specific  distribution  of  the  process  yt  is  assumed.  If 
the  precise  distribution  of  the  process  is  known,  ML  estimation  of  the  VAR 
coefficients  is  possible.  In  the  following,  we  assume  that  yt  is  Gaussian  (nor¬ 
mally  distributed).  The  ML  estimators  of  7  and  Su  are  found  by  equating 
to  zero  the  first  order  partial  derivatives  of  the  log-likelihood  function  and 
solving  for  7  and  Su.  The  partial  derivatives  are  found  as  in  Section  3.4  of 
Chapter  3.  Note  that 

d  In  l  d/3' d  In  l  .  d  In  l 
~d T  =  0^1)13  =  ~dff’ 

by  the  chain  rule  for  vector  differentiation  (Appendix  A. 13).  Proceeding  as  in 
Section  3.4,  the  ML  estimator  of  7  is  seen  to  be 

7  =  [R\ZZ'  <g>  S-1)R]-lR!{Z  ®  r-^z,  (5.2.17) 

where  Zu  is  the  ML  estimator  of  Eu  (see  Problem  5.3).  The  resulting  ML 
estimator  of  (3  is 

/3  =  i?7  +  r.  (5.2.18) 

Furthermore,  the  ML  estimator  of  Eu  is  seen  to  be 

A,u  =  ±(Y  -  BZ){Y  -  BZy ,  (5.2.19) 

where  B  is  the  ( K  x  (Kp  +  1))  matrix  satisfying  vec (B)  =  f3. 

An  immediate  consequence  of  the  consistency  of  the  ML  estimator  Eu  and 
of  Proposition  5.2  is  that  the  EGLS  estimator  7  and  the  ML  estimator  7  are 
asymptotically  equivalent.  In  addition,  it  follows  as  in  Section  3.2.2,  Chapter 
3,  that  Su  has  the  same  asymptotic  properties  as  in  the  unrestricted  case  (see 
Proposition  3.2)  and  (3  and  Eu  are  asymptotically  independent.  In  summary, 
we  get  the  following  result. 

Proposition  5.5  ( Asymptotic  Properties  of  the  Restricted  ML  Estimators) 
Let  yt  be  a  Gaussian  stable  ^-dimensional  VAR(p)  process  as  in  (5.1.1)  and 
(3  =  vec (B)  =  R7  +  r  as  in  (5.2.2).  Then  the  ML  estimators  (3  and  <r  = 
vech(All)  are  consistent  and  asymptotically  normally  distributed, 
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Vf 


'  P-(3' 

I5*- 

0^ 

1 

(T  —  (T 

V  L 

R[R'{r  ®  0 

0  2D+  (Eu  ®  Su) D+' 


where  Dj-  =  (D'kDk)  1D^f  is,  as  usual,  the  Moore-Penrose  inverse  of  the 
( K 2  x  K(K  +  l)/2)  duplication  matrix  D^.  ■ 

Of  course,  we  could  have  stated  the  proposition  in  terms  of  the  joint  dis¬ 
tribution  of  7  and  a  instead.  In  the  following,  the  distribution  given  in  the 
proposition  will  turn  out  to  be  more  useful,  though. 

Both  EGLS  and  ML  estimation  can  be  discussed  in  terms  of  the  mean- 
adjusted  model  considered  in  Section  3.3.  However,  the  present  discussion 
includes  restrictions  for  the  intercept  terms  in  a  convenient  way.  If  the  re¬ 
strictions  are  equivalent  in  the  different  versions  of  the  model,  the  asymp¬ 
totic  properties  of  the  estimators  of  a  :=  vec(Hi, . . . ,  Ap)  will  not  be  af¬ 
fected.  For  instance,  the  asymptotic  covariance  matrix  of  \fT{a  —  a),  where 
a  is  the  ML  estimator,  is  just  the  lower  right-hand  ( K2p  x  K2p)  block  of 
R[R'(r  ®  E~1)R]~1  R'  from  Proposition  5.5.  If  the  sample  means  are  sub¬ 
tracted  from  all  variables  and  the  constraints  are  given  in  the  form  a  =  R-y+r 
for  a  suitable  matrix  R  and  vectors  7  and  r,  the  covariance  matrix  of  the 
asymptotic  distribution  of  \/T{ol  —  a)  can  be  written  as 

R[R\rY(  0)  ®  i;-1)#]-1#,  (5.2.20) 


where  RY{ 0)  :=  Uy  =  Cov(lj)  with  Yt  :=  (y't, . .  .,y't_p+1)’. 


5.2.4  Constraints  for  Individual  Equations 

In  practice,  parameter  restrictions  are  often  formulated  for  the  K  equations 
of  the  system  (5.1.1)  separately.  In  that  case,  it  may  be  easier  to  write  the 
restrictions  in  terms  of  the  vector  b  :=  vec (B')  which  contains  the  parameters 
of  the  first  equation  in  the  first  Kp  +  1  positions  and  those  of  the  second 
equation  in  the  second  Kp  +  1  positions  etc.  If  the  constraints  are  expressed 
as 


b  =  Rc  +  r,  (5.2.21) 

where  I?  is  a  known  (( K2p  +  K)  x  M)  matrix  of  rank  M,  c  is  an  unknown 
(M  x  1)  parameter  vector,  and  r  is  a  known  ( K2p  +  A')-dimensional  vector, 
the  restricted  EGLS  and  ML  estimators  of  b  and  their  properties  are  easily 
derived.  We  get  the  following  proposition: 

Proposition  5.6  ( EGLS  Estimator  of  Parameters  Arranged  Equationwise) 
Under  the  conditions  of  Proposition  5.2,  if  b  =  vec(B')  satisfies  (5.2.21),  the 
EGLS  estimator  of  c  is 


c  =  \R\E~1  ®  ZZ^R^RfZ-1  ®  Z)  [vec(U')  -  {Z  ®  IK)f],  (5.2.22) 
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where  Su  is  a  consistent  estimator  of  Su.  The  corresponding  estimator  of  b 
is 


b  =  K£+f,  (5.2.23) 

which  is  consistent  and  asymptotically  normally  distributed, 

Vf(b  -  b)  4  Af( 0,  RiR'iS-1  ®  (5.2.24) 


The  proof  is  left  as  an  exercise  (see  Problem  5.4).  An  estimator  of  (3  is 

obtained  from  b  by  premultiplying  with  the  commutation  matrix  \Akp+i.k- 
If  the  restrictions  in  (5.2.21)  are  equivalent  to  those  in  (5.2.2),  the  estimator 

for  /3  obtained  in  this  way  is  identical  to  (3  given  in  (5.2.11). 

5.2.5  Restrictions  for  the  White  Noise  Covariance  Matrix 

Occasionally  restrictions  for  the  white  noise  covariance  matrix  Su  are  avail¬ 
able.  For  instance,  in  Chapter  2,  Section  2.3.1,  we  have  seen  that  instantaneous 
noncausality  is  equivalent  to  Su  being  block-diagonal.  Thus,  in  that  case  there 
are  zero  off-diagonal  elements.  Zero  constraints  are,  in  fact,  the  most  common 
constraints  for  the  off-diagonal  elements  of  Uu.  Therefore,  we  will  focus  on 
such  restrictions  in  the  following. 

Estimation  under  zero  restrictions  for  Su  is  often  most  easily  performed  in 
the  context  of  the  recursive  model  introduced  in  Chapter  2,  Section  2.3.2.  In 
order  to  obtain  the  recursive  form  corresponding  to  the  standard  VAR  model 

Vt=v  +  Aiyt-i  +  •  •  •  +  Apyt-p  +  Ut , 

ZJU  is  decomposed  as  Su  =  W  EeW\  where  W  is  lower  triangular  with  unit 
main  diagonal  and  S£  is  a  diagonal  matrix.  Then,  premultiplying  with  W~x 
gives  the  recursive  system 

Ut  =  V  +  Ayi/t  +  A\yt- 1  +  ■  •  •  +  A*yt-P  +  £t, 

where  y  :=  W _1i/,  :=  lx  —  W~x  is  a  lower  triangular  matrix  with  zero 

diagonal,  A*  :=  W~1Ai,  i  =  1 ,p,  and  et  =  (eit, . . .  ,£Kt)'  ■=  has 

diagonal  covariance  matrix,  Se  :=  E{£te't).  The  characteristic  feature  of  the 
recursive  representation  of  our  process  is  that  the  fc-th  equation  may  involve 
2/i, t,  ■  •  ■ ,  yk-i,t  (current  values  of  y±, . . . ,  yk-i)  on  the  right-hand  side  and  the 
components  of  the  white  noise  process  £t  are  uncorrelated. 

Many  zero  restrictions  for  the  off-diagonal  elements  of  Eu  are  equivalent 
to  simple  zero  restrictions  on  Aq  which  are  easy  to  impose  in  equationwise  LS 
estimation.  For  instance,  if  Eu  is  block-diagonal,  say 
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Su  0 
0  S22 


then  Su  and  ZJ22  can  be  decomposed  in  the  form 


Sii=WiS£iW'i,  *  =  1,2, 


where  Wi  is  lower  triangular  with  unit  diagonal  and  Sei  is  a  diagonal  matrix. 
Hence, 


I"  wr1 

0 

4* 

^01 

0 

0 

_ 1 

H71  _ 

0 

4* 

^02  J 

where  the 

r  0  .  0 


A 


*  _ 

0  i  — 


* 


*  =  1,2, 


*  •  •  •  *  0 


are  lower  triangular  with  zero  main  diagonal.  In  summary,  if  Su  is  block- 
diagonal  with  an  (to  x  n)  block  of  zeros  in  its  lower  left-hand  corner,  the 
same  holds  for  Aq. 

Because  the  error  terms  of  the  K  equations  of  the  recursive  system  are  un¬ 
correlated  it  can  be  shown  that  estimating  each  equation  separately  does  not 
result  in  a  loss  of  asymptotic  efficiency  (see  Problem  5.6).  Using  the  notation 


Ukl  ' 

£fel 

. 

>  £(k )  :  — 

-  ykT . 

.  £kT  . 

and  denoting  by  b (*,)  the  vector  of  all  nonzero  coefficients  and  by  the 
corresponding  matrix  of  regressors  in  the  k- th  equation  of  the  recursive  form 
of  the  system,  we  may  write  the  fc-th  equation  as 


U(k)  ^(k)^(k)  “k  £(k)‘ 

The  LS  estimator  of  is 

fyfc)  =  {z{k)Z(k))  z{k)y(k)- 

Under  Gaussian  assumptions,  it  is  equivalent  to  the  ML  estimator  and  is  thus 
asymptotically  efficient.  Obviously,  this  framework  makes  it  easy  to  take  into 
account  zero  restrictions  by  just  eliminating  regressors. 

Generally,  restrictions  on  Su  imply  restrictions  for  Aq  and  vice  versa. 
Unfortunately,  zero  restrictions  on  Su  do  not  always  imply  zero  restrictions 
for  Hq.  Consider,  for  instance,  the  covariance  matrix 
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Thus,  although  Eu  has  a  zero  off-diagonal  element,  all  elements  of  Aq  below 
the  main  diagonal  are  nonzero. 

In  practice,  subject  matter  theory  is  often  more  likely  to  provide  restric¬ 
tions  for  the  Aq  matrix  than  for  Eu  because,  as  we  have  seen  in  Section 
2.3.2,  the  elements  of  Aq  can  sometimes  be  interpreted  as  impact  multipliers 
which  represent  the  instantaneous  effects  of  impulses  in  the  variables.  For  this 
reason,  the  recursive  form  of  the  system  has  considerable  appeal. 

Note,  however,  that  if  restrictions  are  available  on  the  coefficients  Ai  of 
the  standard  VAR  form,  the  implied  constraints  for  the  A*  matrices  should  be 
taken  into  account  in  the  estimation.  Such  restrictions  may  be  cross-equation 
restrictions  that  involve  coefficients  from  different  equations.  Taking  them 
into  account  may  require  simultaneous  estimation  of  all  or  some  equations 
of  the  system  rather  than  single  equation  LS  estimation.  In  the  following 
sections,  we  return  to  the  standard  form  of  the  VAR  model.  Further  discussion 
of  covariance  restrictions  will  be  provided  in  the  context  of  structural  VAR 
models  in  Chapter  9. 


5.2.6  Forecasting 

Forecasting  with  estimated  processes  was  discussed  in  Section  3.5  of  Chapter 
3.  The  general  results  of  that  section  remain  valid  even  if  parameter  restric¬ 
tions  are  imposed  in  the  estimation  procedure.  Some  differences  in  details  will 
be  pointed  out  in  the  following. 

We  focus  on  the  standard  form  (5.1.1)  of  the  VAR  model  and  denote  the 
parameter  estimators  by  u\  Aj  y,-  . .  ,Ap,  and  ]3.  These  estimators  may  be  EGLS 
or  ML  estimators.  The  resulting  h- step  forecast  at  origin  t  is 

yt{h)  =  u  +  Ax yt(h  -  1)  H - 1-  Apyt(h  -  p)  (5.2.25) 

with  yt{j)  :=  yt+j  for  j  <  0,  as  in  Section  3.5.  In  line  with  that  section,  we 
assume  that  forecasting  and  parameter  estimation  are  based  on  independent 
processes  with  identical  stochastic  structure.  Then  we  get  the  approximate 
MSE  matrix 

=  ZyW  +  (5.2.26) 


where 
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h- 1 

Sy(h)  :=  E{[yt+h  -  yt(h))[yt+h  -  yt(h)]'}  =  ^ 

i= 0 


3>i  being,  as  usual,  the  i-tli  coefficient  matrix  of  the  canonical  MA  represen¬ 
tation  of  yt,  and 


Q{h)  :=  E 


dyt(h)  dyt{h)' 
0(3'  &  0(3 


where  Up  is  the  covariance  matrix  of  the  asymptotic  distribution  of  VT(f3  —  (3). 

In  Chapter  3,  the  matrix  ft(h)  has  a  particularly  simple  form  because 
in  that  chapter  E-^  =  C-1  ®  Eu.  In  the  present  situation,  where  parameter 
restrictions  are  imposed, 


Up  =  R[R'(r  ®  E~1)R]~1R( , 


and  the  form  of  fl(h)  is  not  quite  so  simple.  Because  the  covariance  matrix  Vg 
is  now  smaller  than  in  Chapter  3,  fi(h)  will  also  become  smaller  (not  greater). 
Using 


dyt(h) 

d/3' 


h-l 


^2  E't{B')h~1~i  (g>  3>i 
2—0 


from  Chapter  3,  (3.5.11),  we  may  now  estimate  Q(h)  by 


t= i  U=o 


h-l 


h-l 


U=0 


Here 

B  := 


1  0  ••• 
B 

0  Ik(p-i) 


{(Kp+  1)  x  (Kp  +  1)) 


■ 

(5.2.27) 


(see  Section  3.5.2).  In  practice,  the  unknown  matrices  B,  <?>j,  and  E g  are 
replaced  by  consistent  estimators.  Of  course,  if  T  is  large  we  may  simply 
ignore  the  term  f2(h)/T  in  (5.2.26)  because  it  approaches  zero  as  T— >oo. 
An  estimator  of  E^(K)  is  then  obtained  by  simply  replacing  the  unknown 
quantities  in  Ey(h)  by  estimators.  Assuming  that  yt  is  Gaussian,  forecast 
intervals  and  regions  can  be  determined  exactly  as  in  Section  3.5. 


5.2.7  Impulse  Response  Analysis  and  Forecast  Error  Variance 
Decomposition 

Impulse  response  analysis  and  forecast  error  variance  decomposition  with  re¬ 
stricted  VAR  models  can  be  done  as  described  in  Section  3.7.  Proposition  3.6 
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is  formulated  in  sufficiently  general  form  to  accommodate  the  case  of  restricted 
estimation.  The  impulse  responses  are  then  estimated  from  the  restricted  es¬ 
timators  of  Ai, . . .  ,Ap.  As  mentioned  earlier,  the  covariance  matrix  of  the 
restricted  estimator  of  a  :=  vec(Ai, . . . ,  Ap)  is  obtained  by  considering  the 
lower  right-hand  ( K2p  x  K2p)  block  of 

Up  =  R[R\r  ®  U-1)R]-1R'. 

As  we  have  seen  in  Subsection  5.2.3,  Proposition  5.5,  the  asymptotic  covari¬ 
ance  matrix  E&  of  \/T{cr—cr)  is  not  affected  by  the  restrictions  for  /3.  However, 
the  estimator  of 

=  2D+  (Eu  ®  AU)D+' 

may  be  affected.  As  discussed  in  Section  5.2.2,  we  have  the  choice  of  different 
consistent  estimators  for  Eu  which  may  or  may  not  take  into  account  the 
parameter  constraints.  In  other  words,  we  may  estimate  Eu  from  the  residuals 
of  an  unrestricted  estimation  or  we  may  use  the  residuals  of  the  restricted  LS 
or  EGLS  estimation.  The  lower  triangular  matrix  P  that  is  used  in  estimating 
the  impulse  responses  for  orthogonal  innovations  is  estimated  accordingly.  In 
the  examples  considered  below,  we  will  usually  base  the  estimators  of  Eu 
and  P  on  the  residuals  of  the  restricted  EGLS  estimation.  In  contrast,  r  will 
usually  be  estimated  by  ZZ'/T,  as  in  the  unrestricted  case.  Of  course,  instead 
of  the  intercept  version  of  the  process  we  may  use  the  mean-adjusted  form  for 
estimation,  as  mentioned  in  Section  5.2.3. 

5.2.8  Specification  of  Subset  VAR  Models 

A  VAR  model  with  zero  constraints  on  the  coefficients  is  called  a  subset  VAR 
model.  Formally  zero  restrictions  can  be  written  as  in  (5.2.2)  or  (5.2.21)  with 
r  =  r  =  0.  We  have  encountered  such  models  in  previous  chapters.  For  in¬ 
stance,  when  Granger-causality  restrictions  are  imposed,  we  get  subset  VAR 
models.  This  example  suggests  possibilities  how  to  obtain  such  restrictions, 
namely,  from  prior  nonsample  information  and/or  from  tests  of  particular  hy¬ 
potheses.  Subject  matter  theory  sometimes  implies  a  set  of  restrictions  on  a 
VAR  model  that  can  be  taken  into  account,  using  the  estimation  procedures 
outlined  in  the  foregoing.  However,  in  many  cases  generally  accepted  a  priori 
restrictions  are  not  available.  In  that  situation,  statistical  procedures  may  be 
used  to  detect  or  confirm  possible  zero  constraints.  In  the  following,  we  will 
discuss  such  procedures. 

If  little  or  no  a  priori  knowledge  of  possible  zero  constraints  is  available, 
one  may  want  to  compare  various  different  processes  or  models  and  choose  the 
one  which  is  optimal  under  a  specific  criterion.  Using  hypothesis  tests  in  such 
a  situation  may  create  problems  because  the  different  possible  models  may  not 
be  nested.  In  that  case,  statistical  tests  may  not  lead  to  a  unique  answer  as  to 
which  model  to  use.  Therefore,  in  subset  VAR  modelling  it  is  not  uncommon  to 
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base  the  model  choice  on  model  selection  criteria.  For  instance,  appropriately 
modified  versions  of  AIC,  SC,  or  HQ  may  be  employed.  Generally  speaking, 
in  such  an  approach  the  subset  VAR  model  is  chosen  that  optimizes  some 
prespecified  criterion. 

Suppose  it  is  just  known  that  the  order  of  the  process  is  not  greater  than 
some  number  p  and  otherwise  no  prior  knowledge  of  possible  zero  constraints 
is  available.  In  that  situation,  one  would  ideally  fit  all  possible  subset  VAR 
models  and  select  the  one  that  optimizes  the  criterion  chosen.  The  practica¬ 
bility  of  such  a  procedure  is  limited  by  its  computational  expense.  Note  that 
for  a  A'-dimensional  VAR(p)  process,  even  if  we  do  not  take  into  account  the 
intercept  terms  for  the  moment,  there  exist  K 2p  coefficients  from  which 

7) 

subsets  with  j  elements  can  be  chosen.  Thus,  there  is  a  total  of 


subset  VAR  models,  not  counting  the  full  VAR(p)  model  which  is  also  a  possi¬ 
ble  candidate.  For  instance,  for  a  bivariate  VAR(4)  process,  there  are  as  many 
as  216  —  1  =  65,535  subset  models  plus  the  full  VAR(4)  model.  Of  course,  in 
practice  the  dimension  and  order  of  the  process  will  often  be  greater  than  in 
this  example  and  there  may  be  many  more  subset  VAR  models.  Therefore, 
specific  strategies  for  subset  VAR  modelling  have  been  proposed  which  avoid 
fitting  all  potential  candidates.  Some  possibilities  will  be  described  briefly  in 
the  following. 


Elimination  of  Complete  Matrices 

Penm  &  Terrell  (1982)  considered  subset  models  where  complete  coefficient 
matrices  Aj  rather  than  individual  coefficients  are  set  to  zero.  Such  a  strategy 
reduces  the  models  to  be  compared  to 


For  instance,  for  a  VAR(4)  process,  only  16  models  need  to  be  compared. 

An  obvious  advantage  of  the  procedure  is  its  relatively  small  computa¬ 
tional  expense.  Deleting  complete  coefficient  matrices  may  be  reasonable  if 
seasonal  data  with  strong  seasonal  components  are  considered  for  which  only 
coefficients  at  seasonal  lags  are  different  from  zero.  On  the  other  hand,  there 
may  still  be  potential  for  further  parameter  restrictions.  Moreover,  some  of  the 
deleted  coefficient  matrices  may  contain  elements  that  would  not  have  been 
deleted  had  they  been  checked  individually.  Therefore,  the  following  strategy 
may  be  more  useful. 
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Top-Down  Strategy 

The  top-down  strategy  starts  from  the  full  VAR(p)  model  and  coefficients  are 
deleted  in  the  K  equations  separately.  The  fc-th  equation  may  be  written  as 

Ukt  =  Vk  +  +  •  •  •  +  CXkK,lVK,t-l  + 


(5.2.28) 


+  Ukl,pyi,t-p  +  •  •  •  +  OikK,pVK,t-p  +  Ukt- 


The  goal  is  to  find  the  zero  restrictions  for  the  coefficients  of  this  equation 
that  lead  to  the  minimum  value  of  a  prespecified  criterion.  For  this  purpose, 
the  equation  is  estimated  by  LS  and  the  corresponding  value  of  the  criterion  is 
evaluated.  Then  the  last  coefficient  ctkK,P  is  set  to  zero  (i.e.,  ijK,t-P  is  deleted 
from  the  equation)  and  the  equation  is  estimated  again  with  this  restriction. 
If  the  value  of  the  criterion  for  the  restricted  model  is  greater  than  for  the 
unrestricted  model,  yK,t- P  is  kept  in  the  equation.  Otherwise  it  is  eliminated. 
Then  the  same  procedure  is  repeated  for  the  second  last  coefficient,  ak,K- i,P, 
or  variable  yK-\,t-p  and  so  on  up  to  Vk-  In  each  step  a  lag  of  a  variable  is 
deleted  if  the  criterion  does  not  increase  by  that  additional  constraint  com¬ 
pared  to  the  smallest  value  obtained  in  the  previous  steps. 

Criteria  that  may  be  used  in  this  procedure  are 


AIC  =  In  cr2  +  —  (number  of  estimated  parameters) , 


(5.2.29) 


HQ  =  In  a2  + 


2  In  In  T 
T 


(number  of  estimated  parameters), 


or 


(5.2.30) 


SC  =  In  d2  + 


In  T 


(number  of  estimated  parameters) . 


(5.2.31) 


Here  a2  stands  for  the  sum  of  squared  estimation  residuals  divided  by  the 
sample  size  T.  For  instance,  the  AIC  value  for  a  model  with  or  without  zero 
restrictions  is  computed  by  estimating  the  k- th  equation,  computing  the  resid¬ 
ual  sum  of  squares  and  dividing  by  T  to  obtain  a2 .  Then  two  times  the  number 
of  parameters  contained  in  the  estimated  equation  is  divided  by  T  and  added 
to  the  natural  logarithm  of  a2.  In  the  final  equation,  only  those  variables  and 
coefficients  are  retained  that  lead  to  the  minimum  AIC  value. 

In  a  more  formal  manner,  this  procedure  can  be  described  as  follows.  The 
/c-th  equation  of  the  system  may  be  written  as 


Vki 

VkT 


Z'bk  +  W(fc)  —  Z'Rkck  +  tt(fc), 


V(k)  = 
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where  bk  =  Rk^k  reflects  the  zero  restrictions  imposed  on  the  parameters  b k 
of  the  fc-th  equation.  Rk  is  the  restriction  matrix.  The  LS  estimator  of  Ck  is 

Cfc  =  (R'kZZ'Rk)  RlkZy(k) 

and  the  implied  restricted  LS  estimator  for  bk  is 
bk  —  Rk^k  • 


Furthermore,  a  corresponding  estimator  of  the  residual  variance  is 
&2{Rk)  =  (y(k)  ~  Z'bk)'(y(k)  ~  Z'bk)/T. 

Thus,  the  AIC  value  for  a  model  with  these  restrictions  is 

AIC(.fffc)  =  In  a2(Rk)  +  ^ik(Rk). 

The  other  criteria  are  determined  in  a  similar  way. 

In  the  foregoing  subset  procedure  based  on  AIC,  the  unrestricted  model 
with  Rk  =  1  Kp+i  is  estimated  first  and  the  corresponding  value  AIC(/kp+i) 
is  determined.  Then  the  last  column  of  Ixp+i  is  eliminated.  Let  us  denote  the 
resulting  restriction  matrix  by  Rp'1  ■  If 

AIC (R™)  <  AlC(IKp+1), 

—  (2) 

the  next  restriction  matrix  Rk  ,  say,  is  obtained  by  deleting  the  last  column 
of  R^  and  AIC(r[2'1)  is  compared  to  AIC(-R^).  If,  however, 

AIC(R[}])  >  AlC(lKp+1), 

the  restriction  matrix  R,  ’  is  obtained  by  deleting  the  second  last  column  of 
lxp+  i  and  the  next  restriction  matrix  is  decided  upon  by  comparing  AIC(i?). 
to  AIC(/kp+i).  In  each  step,  a  column  of  the  restriction  matrix  is  deleted  if 
that  leads  to  a  reduction  or  at  least  not  to  an  increase  of  the  AIC  criterion. 
Otherwise  the  column  is  retained. 

The  procedure  is  repeated  for  each  of  the  K  equations  of  the  Af-dimensional 
system,  that  is,  a  restriction  matrix,  Rk  say,  is  determined  for  each  equation 
separately.  Once  all  zero  restrictions  have  been  determined  by  this  strategy, 
the  K  equations  of  the  restricted  model  with  overall  restriction  matrix 


Ri 


R  = 


0 


0 

Rk 


can  be  estimated  simultaneously  by  EGLS,  as  described  in  Sections  5.2.2 
and  5.2.4.  Note  that  SC  tends  to  choose  the  most  parsimonious  models  with 
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the  fewest  coefficients  whereas  AIC  has  a  tendency  to  select  the  most  lavish 
models. 

The  advantage  of  this  top-down  procedure,  starting  from  the  top  (largest 
model)  and  then  working  down  gradually,  is  that  it  permits  to  check  all  indi¬ 
vidual  coefficients.  Also,  the  computational  expense  is  very  reasonable.  The 
disadvantage  of  the  method  is  that  it  requires  estimation  of  each  full  equation 
in  the  initial  selection  step.  This  may  exhaust  the  available  degrees  of  freedom 
if  a  model  with  large  order  is  deemed  necessary  for  some  high-dimensional  sys¬ 
tem.  Therefore,  a  slightly  more  elaborate  bottom-up  strategy  may  be  preferred 
occasionally. 


Bottom-Up  Strategy 

Again  the  restrictions  are  chosen  for  each  equation  separately.  In  the  fc-th 
equation,  only  lags  of  the  first  variable  are  considered  initially  and  an  optimal 
lag  length  p\,  say,  for  that  variable  is  selected.  That  is,  we  select  the  optimal 
model  of  the  form 


Ukt.  —  k'k  +  o-ki,iyi,t-i  +  •  •  •  +  0'ki,Plyi,t-p1  +  Ukt 

by  fitting  models 

Ukt.  =  Vk  +  CKfcl,l2/l,t-l  +  •  •  •  +  Olkl,nyi,t-n  +  Ukt, 

where  n  ranges  from  zero  to  some  prespecified  upper  bound  p  for  the  order. 
Pi  is  that  order  for  which  the  selection  criterion,  e.g.,  AIC,  HQ,  or  SC,  is 
minimized. 

In  the  next  step,  p\  is  held  fixed  and  lags  of  y2  are  added  into  the  equation. 
Denoting  the  optimal  lag  length  for  y2  by  p2  gives 

J/fct  =  Vk  +  otk\,iyi,t-\  +  •  •  •  +  otki,p^yi,t.-pi  +  <afc2, 12/2,4-1  +  •  •  • 

+  CKfc2,p22/2,4-p2  +  Ukt- 

Note  that  p2  may,  of  course,  be  zero  in  which  case  y2  does  not  enter  the 
equation. 

In  the  third  step,  pi  and  p2  are  both  held  fixed  and  the  third  variable,  1/3, 
is  absorbed  into  the  equation  in  the  same  way.  This  procedure  is  continued 
until  an  optimal  lag  length  for  each  of  the  K  variables  is  obtained,  conditional 
on  the  “optimal”  lags  of  the  previous  variables. 

Due  to  omitted  variables  effects,  some  of  the  lag  lengths  may  be  over¬ 
stated  in  the  final  equation.  For  instance,  when  none  of  the  other  variables 
enters  the  equation,  lags  of  y\  may  be  useful  in  explaining  ypt  and  in  re¬ 
ducing  the  selection  criterion.  In  contrast,  lags  of  y\  may  not  contribute  to 
explaining  yk  when  lags  of  all  the  other  variables  are  present  too.  Therefore, 
once  pi, . . .  ,pk  are  chosen,  a  top-down  run,  as  described  in  the  previous  sub¬ 
section,  may  complete  the  search  for  zero  restrictions  for  the  fc-th  equation. 
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After  zero  constraints  have  been  obtained  for  each  equation  in  this  fashion, 
the  K  restricted  equations  may  be  estimated  as  one  system,  using  EGLS  or 
ML  procedures. 

Obviously,  it  is  possible  in  this  bottom-up  approach  that  the  largest  model, 
where  all  K  variables  enter  with  p  lags  in  each  equation  is  never  fitted.  Thereby 
considerable  savings  of  degrees  of  freedom  may  be  possible,  especially  if  the 
maximum  order  p  is  substantial.  A  drawback  of  the  procedure  is  that  the  final 
set  of  restrictions  may  depend  on  the  order  of  the  variables. 

Sequential  Elimination  of  Regressors 

Individual  zero  coefficients  can  also  be  chosen  on  the  basis  of  the  f-ratios  of 
the  parameter  estimators.  A  possible  strategy  is  to  sequentially  delete  those 
regressors  with  the  smallest  absolute  values  of  f-ratios  until  all  f-ratios  (in 
absolute  value)  are  greater  than  some  threshold  value,  say  77.  In  this  procedure 
one  regressor  is  eliminated  at  a  time.  Then  new  t-ratios  are  computed  for  the 
reduced  model.  Bruggemann  &  Liitkepohl  (2001)  showed  that  this  strategy 
is  equivalent  to  the  sequential  elimination  based  on  model  selection  criteria  if 
the  threshold  value  77  is  chosen  accordingly.  More  precisely,  they  considered  a 
regression  equation 

Vkt  =  Pixit  H - h  /3NXNt  +  ukt , 

where  all  regressors  are  denoted  by  Xjt,  that  is,  Xjt  may  represent  an  inter¬ 
cept  term  or  lags  of  the  variables  involved  in  our  analysis.  Bruggemann  & 
Liitkepohl  (2001)  studied  a  procedure  where  those  regressors  are  deleted  se¬ 
quentially,  one  at  a  time,  which  lead  to  the  largest  reduction  of  the  given 
selection  criterion  until  no  further  reduction  is  possible.  For  a  model  selection 
criterion  of  the  type 

Cr(*i, . .  ,,in)  =  ln(55E(*i, . . .  ,in)/T )  +  cTn/T , 

where  SSE(ii, . . . ,  in)  is  the  sum  of  squared  errors  obtained  by  includ¬ 
ing  #jlt, . . .  ,Xint  in  the  regression  model  and  ex  is  a  sequence  indexed  by 
the  sample  size.  Bruggemann  &  Liitkepohl  (2001)  showed  that  choosing 
77  =  {[exp(cx/T)  —  1](T  —  N+j  —  l)}1/2  in  the  77-th  step  of  the  elimination  pro¬ 
cedure  based  on  t-ratios  results  in  the  same  final  model  that  is  also  obtained 
by  sequentially  minimizing  the  selection  criterion  defined  by  the  penalty  term 
ct-  Hence,  the  threshold  value  depends  on  the  selection  criterion  via  Ct,  the 
sample  size,  and  the  number  of  regressors  in  the  model.  The  threshold  values 
for  the  t-ratios  correspond  to  the  critical  values  of  the  tests.  For  the  criteria 
AIC,  HQ,  and  SC,  the  cx  sequences  are  cx(AIC)  =  2,  cx(HQ)  =  2  In  In  T,  and 
ex  (SC)  =  In  T,  respectively.  Using  these  criteria  in  the  procedure,  for  an  equa¬ 
tion  with  20  regressors  and  a  sample  size  of  T  =  100,  roughly  corresponds  to 
eliminating  all  regressors  with  f- values  that  are  not  significant  at  the  15-20%, 
10%  or  2-3%  levels,  respectively  (see  Bruggemann  &  Liitkepohl  (2001)). 
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Procedures  similar  to  those  discussed  here  were,  for  instance,  applied 
by  Hsiao  (1979,  1982)  and  Liitkepohl  (1987).  Other  subset  VAR  strategies 
were  proposed  by  Penm  &  Terrell  (1984,  1986),  Penm,  Brailsford  &  Ter¬ 
rell  (2000),  and  Briiggemann  (2004).  Moreover,  more  elaborate,  computer- 
automated  model  specification  and  subset  selection  strategies  based  on  a  mix¬ 
ture  of  testing  and  model  selection  criteria  were  recently  implemented  in  the 
software  package  PcGets  (see  Hendry  &  Krolzig  (2001)).  The  alternative  sub¬ 
set  modelling  procedures  all  have  their  advantages  and  drawbacks.  Therefore, 
at  this  stage,  none  of  them  can  be  recommended  as  a  universally  best  choice 
in  practice. 


5.2.9  Model  Checking 

After  a  subset  VAR  model  has  been  fitted,  some  checks  of  the  model  adequacy 
are  in  order.  Of  course,  one  check  is  incorporated  in  the  model  selection  pro¬ 
cedure  if  some  criterion  is  optimized.  By  definition,  the  best  model  is  the  one 
that  leads  to  the  optimum  criterion  value.  In  practice,  the  choice  of  the  crite¬ 
rion  is  often  ad  hoc  or  even  arbitrary  and,  in  fact,  several  competing  criteria 
are  often  employed.  It  is  then  left  to  the  applied  researcher  to  decide  on  the 
final  model  to  be  used  for  forecasting  or  economic  analysis.  In  some  cases,  sta¬ 
tistical  tests  of  restrictions  may  aid  in  that  decision.  For  example,  F-tests,  as 
described  in  Section  4.2.2,  may  be  helpful  for  that  purpose.  In  the  following, 
we  will  discuss  tests  for  residual  autocorrelation. 


Residual  Autocovariances  and  Autocorrelations 

The  autocorrelation  tests  considered  in  Chapter  4  can  also  be  used  to  check 
the  white  noise  assumption  for  the  ut  process  in  a  subset  VAR  model,  if 
suitable  adjustments  are  made.  For  that  purpose,  we  will  first  consider  the 
residual  autocovariances  and  autocorrelations.  In  analogy  with  Section  4.4  of 
Chapter  4,  we  use  the  following  notation: 

1  T 

C,  :=  —  £  utu't_i,  *  =  0,1,..., ft, 

t=i-\- 1 

Ch  :=  (Clt...,Ch), 
c h  ■=  vec(Ch), 

Ut  is  the  t-th  estimation  residual  of  a  restricted  estimation, 

1  T 

Cl  :=  f  utu't-i> 

t=i-\- 1 

Ch:=  (C1,...,Ch), 


i  0,1 , . . . ,  /i, 
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ch  :=  vec(Ch), 

D  is  the  diagonal  matrix  with  the  square  roots  of  the  diagonal  elements  of  Gy 
on  the  diagonal, 

Ri  :=  D~1diD~1 ,  i  =  0,l,...,h, 

R-/i  :=  (Ri,  ■  ■  ■  ■  Rh), 
rh  :=  vec(R/j). 

In  the  following  proposition,  the  asymptotic  distributions  of  and  17,  are 
given  under  the  assumption  of  a  correctly  specified  model. 

Proposition  5.7  ( Asymptotic  Distributions  of  Residual  Autocovariances  and 
Autocorrelations ) 

Suppose  yt  is  a  stable,  stationary,  JC-dimensional  VAR(p)  process  with  identi¬ 
cally  distributed  standard  white  noise  ut  and  the  parameter  vector  (3  satisfies 
the  restrictions  (3  =  R-y  +  r  with  R  being  a  known  (K(Kp  +  1)  x  M)  ma¬ 
trix  of  rank  M.  Furthermore  suppose  that  f3  is  estimated  by  EGLS  such  that 

(3  =  f?7  +  r.  Then 

Vfch^A f(0,S^(h)),  (5.2.32) 

where 

Erc{h)  =Ih®  Su  ®  Su  -  GR[R\r  ®  S-^R^R'G' 
and  G  :=  G' ®Ik  is  the  matrix  defined  in  Chapter  4,  Lemma  4.2.  Furthermore, 
y/f?h±Af(0,Err{h)),  (5.2.33) 

where 

SrT{h)  =lh®  Ru  ®  Ru  -  (Gq  ®  D-l)R[R\r  ®  T'-1)f?]-1f?'(G0  ®  D"1) 

and  Go  :=  G(J/,  ®  D _1)  is  defined  in  Proposition  4.6,  D  is  the  diagonal  matrix 
with  the  square  roots  of  the  diagonal  elements  of  Ru  on  the  diagonal,  and 
Ru  :=  D~x 27mD_1  is  the  correlation  matrix  corresponding  to  Su.  ■ 

Proof:  The  proof  is  similar  to  that  of  Propositions  4.5  and  4.6.  Defining  G 
as  in  Lemma  4.2,  the  lemma  implies  that  \/Tch  is  known  to  have  the  same 
asymptotic  distribution  as 

Vfch  -  VfG  vec  {B  -  B) 

=  [-G'  ®IK-I]  ^  veH^  “  B ^ 

VTch 
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=  [-(G'®1k)R:I] 


VT( 7  -  7) 
VTcj j 


r  ,  ( zz'  -  1 

-1 

(G'%Ik)H:  I 

R'  0  R 

R'(Ikp+i  ®  -1)  0 

0 

I 

X 


-4=vec(J7Z/) 

VTch 


(see  (5.2.7)).  The  asymptotic  distribution  in  (5.2.32)  then  follows  from  Lemma 
4.3  and  Proposition  C.2  by  noting  that  T  =  plim(Z  Z' /T)  and  Eu  =  plim  Eu. 
The  limiting  distribution  of  \[T r^  follows  as  in  the  proof  of  Proposition  4.6. 


The  results  in  Proposition  5.7  can  be  used  to  check  the  white  noise  as¬ 
sumption  for  the  Ut  s.  As  in  Section  4.4,  residual  autocorrelations  are  often 
plotted  and  evaluated  on  the  basis  of  two-standard  error  bounds  about  zero. 
Estimators  of  the  standard  errors  are  obtained  by  replacing  all  unknown  quan¬ 
tities  in  Ef{h)  by  consistent  estimators.  Specifically  Eu  may  be  estimated  by 
Gy.  We  will  illustrate  the  resulting  white  noise  test  in  Section  5.2.10  with  an 
example. 


Portmanteau  Tests 


For  the  portmanteau  statistic 
h 

Qh  ■■=  r^tr(c:a0-iaa0-1) 

i= 1 

=  Tc'h(Ih  0  a-1  0  a-1)^  (5.2.34) 

we  get  the  following  result. 

Proposition  5.8  ( Approximate  Distribution  of  the  Portmanteau  Statistic) 
Suppose  the  conditions  of  Proposition  5.7  are  satisfied  and  there  are  no  re¬ 
strictions  linking  the  intercept  terms  to  the  Ai, . . . ,  Ap  coefficients,  that  is, 


R  = 


R{  1)  0 

0  R(  2) 


is  block-diagonal  with  R^  and  R(2)  having  row-dimensions  K  and  K2p , 
respectively.  Then  Qh  has  an  approximate  limiting  ^-distribution  with 
K2h  —  rk(i?(2))  degrees  of  freedom  for  large  T  and  h.  ■ 
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Proof:  Under  the  conditions  of  the  proposition,  the  covariance  matrix  of  the 
asymptotic  distribution  in  (5.2.32)  is 

KW  =lh®Su®  Su  -  GR(2){R[2)[rY(  0)  0  U-1]f?(2)}-1f?,{2)G,I 

where  G  is  the  matrix  defined  in  Lemma  4.2.  Using  this  fact,  Proposition  5.8 
can  be  proven  just  as  Proposition  4.7  by  replacing  G  in  that  proof  by  GR(2) 
(see  Section  4.4.3).  ■ 

The  degrees  of  freedom  in  Proposition  5.8  are  obtained  by  subtracting  the 
number  of  unconstrained  coefficients  from  K2h.  As  in  Section  4.4.3,  the 
modified  portmanteau  statistic 

h 

Qh  ■=  T2  J2(T -  irMcld-^d-1)  (5.2.35) 

may  be  preferable  for  testing  the  white  noise  assumption  in  small  samples.  In 
other  words,  under  the  white  noise  hypothesis,  the  small  sample  distribution 
of  Qh  may  be  closer  to  the  approximate  ^-distribution  than  that  of  Qh- 

LM  Test  for  Residual  Autocorrelation 

As  for  unrestricted  models,  an  LM  test  for  residual  autocorrelation  can  also 
be  constructed  if  parameter  restrictions  are  imposed  on  a  VAR  model.  For 
simplicity  of  exposition,  we  assume  now  that  the  restrictions  can  be  written 
in  the  form  (3  =  vec (B)  =  R'j.  For  example,  there  may  be  zero  restrictions.  In 
that  case,  a  possible  test  statistic  may  be  obtained  by  considering  the  auxiliary 
model 

S~1/2U  =  S~1/2BZ  +  S~1/2DU  +  £,  (5.2.36) 

where  D  =  [D\  ■.-■■■.  Dh]  is  ( K  x  Kh ),  U  =  (//,  0  U)F'  with  F  as  in  (4.4.2), 
£  =  [ei, . . . ,  £t\  is  a  ( K  x  T )  error  matrix,  Eu  is  some  consistent  estimator  of 
Su  which  has  been  used  in  EGLS  estimation,  Su  '  is  a  symmetric  matrix 
such  that  U-1/2V”1/2  =  U”1,  and  the  other  symbols  are  defined  as  before. 
Note,  however,  that  the  Ut  are  now  the  residuals  from  EGLS  estimation  of  the 
original  restricted  VAR(p)  model  and  Ut  =  0  for  t  <  0.  The  vectorized  version 
of  the  auxiliary  model  (5.2.36)  is 

{It  ®  V-1/2)vec(C/)  =  {Z'  0  £-1/2)R7  +  (Q’  0  U-^^vecp)  +  vec(f). 
Defining  8  =  vec (£>),  the  (EG)LS  estimator  from  this  auxiliary  model  is 

'  7  1  [  R\ZZ'  ®  E-QR  R'(ZU'  ®  V-1)  1  [  R'(Z  ®  E-1)  1  y  ~ 

.Sj  [  (UZ'  ®E~1)R  UU'tZE-1  \  [  U&E-1  \  VCC 
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The  first  order  conditions  for  a  minimum  of  the  EGLS  objective  function  for 
the  original  restricted  VAR  model  are 

d[y  -  {Z'  ®  Ik)Ri]'(Ik  ®  jj^)[y  -  {Z'  (g>  ik)Rj] 

7 

=  -2 R\Z  ® IK)(IK  ®  r,-1)^  -  (Z7  ®  =  0. 

Hence,  R'  (Z  ®  Eif1)  vec (£/)  =  0.  Applying  the  rules  for  the  partitioned  inverse 
(see  Appendix  A.  10)  thus  gives 

5  =  (UU'®!]-1 

~{UZ'  ®  E~l)R[B!{ZZ'  ®  ®  5”1)) 

xvectA.-1^7). 

The  usual  %2-statistic  for  testing  5  =  0  is 
A  LM{h)  =  d'iuu'®^1 

~(UZ'  ®  I;-1)^#^7  (8)  S-1)R]~1R\ZU'  ®  A"1))?. 
Substituting  the  expression  for  it  can  be  seen  that 

Alm(A)  =  Tc'hIJ^(h)~1ch, 
where 

-(Z?Z7  ®  i;-1)^7^7  ®  E-^R^R'iZU'  ®  A"1)) 

is  a  consistent  estimator  of  E'c{h).  Thus,  the  situation  is  completely  analogous 
to  the  case  of  an  unrestricted  model  treated  in  Section  4.4.4  and  we  get  the 
following  result  from  Propositions  5.7  and  C.15(5). 

Proposition  5.9  ( Asymptotic  Distribution  of  LM  Statistic  for  Residual  Au¬ 
tocorrelation  of  Restricted  VAR) 

Under  the  conditions  of  Proposition  5.7, 

A LM(h)  4  x\hK2). 


Notice  that  unlike  for  the  portmanteau  test,  the  asymptotic  distribution  of 
the  LM  statistic  is  identical  to  that  obtained  for  unrestricted  VARs  in  Propo¬ 
sition  4.8.  However,  A lm{K)  is  in  general  not  exactly  an  LM  statistic  because 
the  restricted  estimator  7  is  not  identical  to  the  ML  estimator.  Clearly,  this 
does  not  affect  the  asymptotic  properties  of  the  test  statistic. 
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Other  Checks  of  Restricted  Models 

It  must  also  be  kept  in  mind  that  our  discussion  has  been  based  on  a  number 
of  further  assumptions  that  should  be  checked.  Prominent  among  them  are 
stationarity,  stability,  and  normality.  The  latter  is  used  in  setting  up  forecast 
intervals  and  regions  and  the  former  properties  are  basic  conditions  underlying 
much  of  our  analysis  (see,  for  instance,  Propositions  5. 1-5. 6).  The  stability 
tests  based  on  predictions  and  described  in  Section  4.6  of  Chapter  4  may  be 
applied  in  the  same  way  as  for  full  unrestricted  VAR  processes.  Of  course,  now 
the  forecasts  and  MSE  matrix  estimators  should  be  based  on  the  restricted 
coefficient  estimators  as  discussed  in  Section  5.2.6.  Also,  it  is  easy  to  see  from 
Section  4.5  that  the  tests  for  nonnormality  remain  valid  when  true  restrictions 
are  placed  on  the  VAR  coefficient  matrices. 


5.2.10  An  Example 

As  an  example,  we  use  again  the  same  data  as  in  Section  3.2.3  and  some  other 
previous  sections.  That  is,  r/it,  yit*.  and  V3t  are  first  differences  of  logarithms  of 
investment,  income,  and  consumption,  respectively.  We  keep  four  presample 
values  and  use  sample  values  from  the  period  1961.2-1978.4.  Hence,  the  time 
series  length  is  T  =  71.  We  have  applied  the  top-down  strategy  with  selection 
criteria  AIC,  HQ,  and  SC  and  a  VAR  order  of  p  =  4.  In  other  words,  we  use 
the  same  maximum  order  as  in  the  order  selection  procedure  for  full  VAR 
models  in  Chapter  4.  Because  HQ  and  SC  choose  the  same  model,  we  get 
two  different  models  only  which  are  shown  in  Table  5.1.  As  usual,  the  HQ-SC 
model  is  more  parsimonious  than  the  AIC  model. 

In  Table  5.1,  modified  portmanteau  statistics  with  corresponding  p- values 
are  also  given  for  both  models.  Obviously,  none  of  the  test  values  gives  rise 
to  concern  about  the  models.  In  Figure  5.1,  residual  autocorrelations  of  the 
HQ-SC  model  with  estimated  two-standard  error  bounds  about  zero  are  de¬ 
picted.  The  rather  unusual  looking  estimated  two-standard  error  bounds  for 
some  low  lags  are  a  consequence  of  the  zero  elements  in  the  estimated  VAR 
coefficient  matrices.  Recall  that  the  asymptotic  standard  errors  are  bounded 
from  above  by  1  /y/T.  For  low  lags,  they  can  be  substantially  smaller,  however, 
and  this  property  is  clearly  reflected  in  Figure  5.1.  Although  some  individual 
autocorrelations  fall  outside  the  two-standard  error  bounds  about  zero,  this 
is  not  necessarily  a  reason  for  modifying  the  model.  As  in  Chapter  4,  such  a 
decision  depends  on  which  criterion  is  given  priority. 

We  have  also  produced  forecasts  with  the  HQ-SC  model  and  give  them  in 
Table  5.2  together  with  forecasts  from  a  full  VAR(4)  model.  In  this  example, 
the  forecasts  from  the  two  models  are  quite  close  and  the  estimated  forecast 
intervals  from  the  subset  model  are  all  smaller  than  those  of  a  full  VAR(4) 
model.  Although  theoretically  the  more  parsimonious  subset  model  produces 
more  precise  forecasts  if  the  restrictions  are  correct,  it  must  be  kept  in  mind 
that  in  the  present  case  the  restrictions,  the  forecasts  and  forecast  intervals 
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Table  5.1.  EGLS  estimates  of  subset  VAR  models  for  the  investment/income/con¬ 
sumption  data 


model  selection  criterion 

AIC 

HQ-SC 

V 

.015* 

(.006) 

.015 

(.003) 

.013 

(.003) 

.015 

(.006) 

.020 

(.001) 

.016 

(.003) 

-. 219 

0 

0 

"  -.225 

0 

0 

A  i 

(.104) 

0 

0 

.235 

(.104) 

0 

0 

0 

0 

.274 

(.133) 

-.391 

0 

.261 

-.439 

. 

(.082) 

(.116) 

. 

(.081) 

(.095) 

A  2 


0 

0 

0  l 

~  0 

0 

0  ' 

.010 

(.024) 

0 

0 

0 

0 

0 

0 

.335 

0 

0 

.329 

0 

- 

(.073) 

. 

(.074) 

1 

O 

0 

° 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

O 

.095 

(.076) 

0  J 

0 

0 

0 

"  .340 

(.103) 

0 

0  1 

"  .331 

(.103) 

0 

0  " 

0 

0 

0 

0 

0 

0 

1 - 

C 

0 

0  J 

0 

0 

0 

Q 12  =  79.3  [.937]** 

_ Q2 o  =  144  [.943] _ 

’Estimated  standard  errors  in  parentheses. 
**p- value. 


Qi2  =  85.5  [.893] 
Q20  =  152  [.898] 
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Fig.  5.1.  Estimated  residual  autocorrelations  of  the  investment /income/consump¬ 
tion  HQ-SC  subset  VAR  model  with  estimated  asymptotic  two-standard  error 
bounds. 


are  estimated  on  the  basis  of  a  single  realization  of  an  unknown  data  gener¬ 
ation  process.  Under  these  circumstances,  a  subset  model  may  produce  less 
precise  forecasts  than  a  heavily  parameterized  full  VAR  model.  Note  that  in 
the  present  subset  model,  the  income  forecasts  are  the  same  for  all  forecast 
horizons  because  income  is  generated  by  a  white  noise  process  in  the  HQ-SC 
model. 

We  have  also  computed  impulse  responses  from  the  HQ-SC  subset  VAR 
model.  The  Ot  responses  of  consumption  to  an  impulse  in  income  based  on 
orthogonalized  residuals  are  depicted  in  Figure  5.2.  Comparing  them  with 
Figure  3.8  shows  that  they  are  qualitatively  similar  to  the  impulse  responses 
from  the  full  VAR(2)  model.  Considering  the  responses  of  investment  to  a  con¬ 
sumption  innovation  reveals  that  they  are  all  zero  in  the  subset  VAR  model. 
A  closer  look  at  Table  5.1  shows  that  income/consumption  are  not  Granger- 
causal  for  investment  in  both  subset  models.  This  result  was  also  obtained  in 
the  full  VAR  model  (see  Section  3.6.2).  However,  now  it  is  directly  seen  in  the 
model  without  further  causality  testing.  In  other  words,  the  causality  testing 
is  built  into  the  model  selection  procedure. 
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Table  5.2.  Point  and  interval  forecasts  from  full  and  subset  VAR(4)  models  for  the 
investment /income / consumption  example 


variable 

forecast 

horizon 

full  VAR(4) 

HQ-SC  subset  VAR(4) 

point 

forecast 

95%  interval 
forecast 

point 

forecast 

95%  interval 
forecast 

investment 

1 

.006 

[-.091,. 103] 

.015 

[-.074,.105] 

2 

.025 

]-.075,.125] 

.023 

[-.068,.  115] 

3 

.028 

[-.071,. 126] 

.018 

[-.073,. 110] 

4 

.026 

[-.074,  .125] 

.023 

[-.069,.  115] 

income 

1 

.021 

[-.005,  .047] 

.020 

[-.004,  .044] 

2 

.022 

[-.004,  .049] 

.020 

[-.004,  .044] 

3 

.017 

[-.009,  .043] 

.020 

[-.004,  .044] 

4 

.022 

[-.004,  .049] 

.020 

[-.004,  .044] 

consumption 

1 

.022 

[  .001,. 042] 

.023 

[  .004,  .042] 

2 

.015 

[-.006,  .036] 

.013 

[-.007,  .033] 

3 

.020 

[-.004,  .043] 

.022 

[  .001,  .044] 

4 

.019 

[-.004,  .042] 

.018 

[-.004,  .040] 

o 

i-'- 

o 


I 


Fig.  5.2.  Estimated  responses  of  consumption  to  an  orthogonalized  impulse  in 
income  with  two-standard  error  bounds  based  on  the  HQ-SC  subset  VAR  model. 
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5.3  VAR  Processes  with  Nonlinear  Parameter 
Restrictions 

Some  authors  have  suggested  nonlinear  constraints  for  the  coefficients  of  a 
VAR  model.  For  instance,  multiplicative  models  with  VAR  operator 

A(L)  =  IK-  AxL - APLP 

=  (Ik  -  B XLS - BqLsQ)(Ik  -C\L - CqLq) 

=  B(L°)C(L) 

have  been  considered.  Here  L  is  the  lag  operator  defined  in  Chapter  2,  Section 
2.1.2,  the  .Bj’s  and  C/ s  are  (. K  x  K)  coefficient  matrices  and  B(LS )  contains 
“seasonal”  powers  of  L  only.  Such  models  may  be  useful  for  seasonal  data. 
For  instance,  for  quarterly  data,  a  multiplicative  seasonal  operator  may  have 
the  form 

(Ik  ~  BiL4)(Ik  —  C\L  —  C2L2). 

The  corresponding  VAR  operator  is 

A(L)  =  IK  -  AxL - A6L6 

—  Ik  —  C\L  —  C2L2  —  B\L4  ■  B\C\L^  -t- 

so  that  A\  =  C'i ,  A2  —  C2,  A3  =  0,  A4  =  Bi,  A5  =  — / !  1 1 '  1 .  Ag 
l>  <  .  Hence,  the  coefficients  a  :=  vec[Ai, . . . ,  Av]  are  determined  by  7  := 
vec[£>-i ,  64 ,  C'2],  that  is, 

a  =  g(-/).  (5.3.1) 

There  are  also  other  types  of  nonlinear  constraints  that  may  be  written  in  this 
way.  For  example,  the  VAR  operator  may  have  the  form  A(L)  =  B(L)C(L), 
where 


C(L) 


ci  (L)  0 

0  ck(L) 


is  a  diagonal  operator  with  Ck(L)  =  1  +  CkiL  +  •  •  •  +  CkqLq ,  which  represents 
the  individual  dynamics  of  the  variables  and  B(L)  =  Ik  —  B\L  —  •  •  •  —  BnLn 
takes  care  of  joint  relations.  Again  the  implied  restrictions  for  a  can  easily  be 
cast  in  the  form  (5.3.1). 

In  principle,  under  general  conditions,  if  restrictions  are  given  in  the  form 
(5.3.1),  the  analysis  can  proceed  analogously  to  the  linear  restriction  case. 
That  is,  we  need  to  find  an  estimator  7  of  7,  for  instance,  by  minimizing 

S(i)  =  [y  -  (z'  ®  iK)g(i)\(iT  ®  ^“^[y  -  (z'  <g>  iK)g(i)\, 
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where  y,  Z,  and  Su  are  as  defined  in  Section  5.2.  The  minimization  may 
require  an  iterative  algorithm.  Such  algorithms  are  described  in  Section  12.3.2 
in  the  context  of  estimating  VARMA  models.  Once  we  have  an  estimator  7,  we 
may  estimate  a  as  a  =  (7(7).  Under  similar  conditions  as  for  the  linear  case, 
the  estimators  will  be  consistent  and  asymptotically  normally  distributed, 

e-g-, 

Vf(a-  (5.3.2) 

The  corresponding  estimators  A\ , . . . ,  Av  may  be  used  in  computing  forecasts 
and  impulse  responses  etc.  The  asymptotic  properties  of  these  quantities  then 
follow  exactly  as  in  the  previous  sections  (see  in  particular  Sections  5.2.6  and 
5.2.7). 

Another  type  of  “multiplicative”  VAR  operator  has  the  form 
A(L)  =  IK  -  B{L)C{L),  (5.3.3) 

where 


B{L)  =  Ba  +  B\L  +  •  •  •  +  BqLq 

is  of  dimension  ( K  x  r),  that  is,  the  BJ s  have  dimension  (K  x  r),  and 
C(L)  =  CXL  +  •  •  •  +  CpLp 

is  of  dimension  (r  x  K ),  with  r  <  K.  For  p  =  1,  neglecting  the  intercept  terms, 
the  process  becomes 

Vt  =  B0ClUt-l  +  '  '  '  +  BqCl'lJt-q-l  +  ut 

which  is  sometimes  called  an  index  model  because  yt  is  represented  in  terms 
of  lagged  values  of  the  “index”  C\yt ■  In  the  extreme  case  where  r  =  1, 
Ciyt  is  simply  a  weighted  sum  or  index  of  the  components  of  yt  which  jus¬ 
tifies  the  name  of  the  model.  Such  models  have  been  investigated  by  Reinsel 
(1983)  in  some  detail.  Alternatively,  if  q  =  0,  the  process  is  called  reduced 
rank  (RR)VAR  process  which  has  been  analyzed  by  Velu,  Reinsel  &  Wichern 
(1986),  Tso  (1981),  Ahn  &  Reinsel  (1988),  Reinsel  (1993,  Chapter  6),  Reinsel 
&  Velu  (1998)  and  Anderson  (1999,  2002)  among  others.  Models  with  a  re¬ 
duced  rank  structure  in  the  coefficients  will  be  of  considerable  importance  in 
Part  II,  where  VAR  processes  with  cointegrated  variables  are  considered.  We 
will  therefore  not  discuss  them  here. 


5.4  Bayesian  Estimation 

5.4.1  Basic  Terms  and  Notation 

Although  the  reader  is  assumed  to  be  familiar  with  Bayesian  estimation,  we 
summarize  some  basics  here.  In  the  Bayesian  approach,  it  is  assumed  that 
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the  nonsample  or  prior  information  is  available  in  the  form  of  a  density.  De¬ 
noting  the  parameters  of  interest  by  a,  let  us  assume  that  the  prior  infor¬ 
mation  is  summarized  in  the  prior  probability  density  function  (p.d.f.)  g(ot). 
The  sample  information  is  summarized  in  the  sample  p.d.f.,  say/(y|a),  which 
is  algebraically  identical  to  the  likelihood  function  Z(a|y).  The  two  types  of 
information  are  combined  via  Bayes’  theorem  which  states  that 


ff(«|y) 


f(y\a)g(a) 

/( y) 


where  /( y)  denotes  the  unconditional  sample  density  which,  for  a  given  sam¬ 
ple,  is  just  a  normalizing  constant.  In  other  words,  the  distribution  of  a, 
given  the  sample  information  contained  in  y,  can  be  summarized  by  g(a |y). 
This  function  is  proportional  to  the  likelihood  function  times  the  prior  density 


9{ol |y)  a  f(y\a)g(a)  =  l(at\y)g(cx).  (5.4.1) 

The  conditional  density  g(a |y)  is  the  posterior  p.d.f..  It  contains  all  the  in¬ 
formation  available  on  the  parameter  vector  a.  Point  estimators  of  a.  may  be 
derived  from  the  posterior  distribution.  For  instance,  the  mean  of  that  distri¬ 
bution,  called  the  posterior  mean ,  is  often  used  as  a  point  estimator  for  a.  In 
the  next  subsection  this  general  framework  is  specialized  to  VAR  models. 


5.4.2  Normal  Priors  for  the  Parameters  of  a  Gaussian  VAR 
Process 

Suppose  yt  is  a  zero  mean,  stable,  stationary  Gaussian  VAR(p)  process  of  the 
form 


Vt  —  Aiyt-i  +  •  •  •  +  Apyt-p  +  Ut 


and  the  prior  distribution  for  a  :=  vec(A)  =  vec(Ai, . . .  ,Ap)  is  a  multivariate 
normal  with  known  mean  a*  and  covariance  matrix  Id*, 


|K<r1/2exp 


a*)'K 


1(OL 


(5.4.2) 


Combining  this  information  with  the  sample  information  summarized  in  the 
Gaussian  likelihood  function, 

KT/2 


l(a\y) 


-  (s)' 


\It  ®  JFul-1/2 


xexp 


-^(y  -  iX'  ®  1k)ol)'{1t  ®  :)(y  -  ( X ’  ®  Ik)cl) 


(see  Chapter  3,  Section  3.4,  for  the  definitions),  gives  the  posterior  density 
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g(<x |y)  cx  g(a)l(a |y) 

exp{-^  (P“1/2(a-o:*))'(V“1/2(Q;-a*)) 

+  {(lT®£-1/2)y-(X'®£-1/2)cxy 

x{(IT®  X"V2)y  -  (X'  ®  S~1,2)a}\  }.  (5.4.3) 

Here  Va  1  ~  and  Su  1  2  denote  the  symmetric  square  root  matrices  of  V~l  and 
X”1,  respectively  (see  Appendix  A. 9. 2).  The  white  noise  covariance  matrix 
Su  is  assumed  to  be  known  for  the  moment.  Defining 


y-1' 2 

X'  ®  X"1/2  _  ’ 


the  exponent  in  (5.4.3)  can  be  rewritten  as 
—  \{w  —  Wa)'(w  —  Wa) 

=  -|[(a  -  a)'W'W{a  —  a)  +  (w  —  Wa)'{w  -  Wa)},  (5.4.4) 


where 


a  :=  (W'Wy'W’w  =  [V"1  +  (XX'  ®  X,-1)]-1^"^*  +  (X  ®  X"1)^. 

(5.4.5) 

Because  the  second  term  on  the  right-hand  side  of  (5.4.4)  does  not  contain  a, 
it  may  be  absorbed  into  the  constant  of  proportionality.  Hence, 

ff(a|y)ocexp  ^(a  -  a)'X“1(a  -  a)  , 

where  a  is  given  in  (5.4.5)  and 

Xa  :=  {W'W)-1  =  [V-1  +  (XX'  ®  X"1)]"1.  (5.4.6) 

Thus,  the  posterior  density  is  easily  recognizable  as  the  density  of  a  multi¬ 
variate  normal  distribution  with  mean  a  and  covariance  matrix  Xa,  that  is, 
the  posterior  distribution  of  a  is  Af(a,  Xa).  This  distribution  may  be  used 
for  inference  regarding  a. 

Sometimes  one  would  like  to  leave  some  of  the  coefficients  without  any 
restrictions  because  no  prior  information  is  available.  In  the  above  framework, 
this  case  can  be  handled  by  setting  the  corresponding  prior  variance  to  infinity. 
Unfortunately,  such  a  choice  is  inconvenient  here  because  algebraic  operations 
have  to  be  performed  with  the  elements  of  Va  in  order  to  compute  a  and  Xa . 
Therefore,  in  such  cases  it  is  preferable  to  write  the  prior  information  in  the 
form 


Ca  =  c+e  with  e~A/’(0,/). 


(5.4.7) 
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Here  C  is  a  fixed  matrix  and  c  is  a  fixed  vector.  If  C  is  a  ( K2p  x  K2p) 
nonsingular  matrix, 

a  ~  C~1C~V). 

That  is,  the  prior  information  is  given  in  the  form  of  a  multivariate  normal 
distribution  with  mean  C~lc  and  covariance  matrix  ( C'C )_1.  From  (5.4.5), 
under  Gaussian  assumptions,  the  resulting  posterior  mean  is 

a=[CC+{. XX'  ®  £-1)]-1[C'c+  (X  ®  U-^y].  (5.4.8) 

A  practical  advantage  of  this  representation  of  the  posterior  mean  is  that  it 
does  not  require  the  inversion  of  Va.  Moreover,  this  form  can  also  be  used 
if  no  prior  information  is  available  for  some  of  the  coefficients.  For  instance, 
if  no  prior  information  on  the  first  coefficient  is  available,  we  may  simply 
eliminate  one  row  from  C  and  put  zeros  in  the  first  column.  Although  the 
prior  information  cannot  be  represented  in  the  form  of  a  proper  multivariate 
normal  distribution  in  this  case,  the  estimator  a  in  (5.4.8)  can  still  be  used. 

In  order  to  make  these  concepts  useful,  the  prior  mean  cc*  and  covariance 
matrix  Va  or  C  and  c  must  be  specified.  In  the  next  subsection  possible  choices 
are  considered. 


5.4.3  The  Minnesota  or  Litterman  Prior 


In  Litterman  (1986)  and  Doan,  Litterman  &  Sims  (1984),  a  specific  prior,  often 
referred  to  as  Minnesota  prior  or  Litterman  prior,  for  the  parameters  of  a  VAR 
model  is  described.  A  similar  prior  will  be  considered  here  as  an  example.  The 
so-called  Minnesota  prior  was  suggested  for  certain  nonstationary  processes. 
We  will  adapt  it  for  the  stationary  case  because  we  are  still  dealing  with 
stationary,  stable  processes.  The  nonstationary  version  of  the  Minnesota  prior 
will  be  presented  in  Chapter  7. 

If  the  intertemporal  dependence  of  the  variables  is  believed  to  be  weak, 
one  way  to  describe  this  is  to  set  the  prior  mean  of  the  VAR  coefficients  to 
zero  with  nonzero  prior  variances.  In  other  words,  a*  =  0  and  Va  ^  0.  With 
this  choice  of  a*  the  posterior  mean  in  (5.4.5)  reduces  to 

a  =  [KT1  +  {XX'  ®  r-1)]-1^  18  S-1) y.  (5.4.9) 


This  estimator  for  a  looks  like  the  multivariate  LS  estimator  except  for  the 
inverse  covariance  matrix  V~l. 

In  the  spirit  of  Litterman  (1986),  the  prior  covariance  matrix  Va  may  be 
specified  as  a  diagonal  matrix  with  diagonal  elements 


/  0/0  2  if  *  =  3, 

\  (A OcTi/lcTj)2  if  i  ^  j, 


(5.4.10) 


where  V{jj  is  the  prior  variance  of  A  is  the  prior  standard  deviation  of 
the  coefficients  a^k,! ,  fc  =  1, . . . ,  K,  0  <  6  <  1,  and  of  is  the  z-th  diagonal 
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element  of  Su.  For  each  equation,  A  controls  how  tightly  the  coefficient  of 
the  first  lag  of  the  dependent  variable  is  believed  to  be  concentrated  around 
zero.  For  instance,  in  the  fc-th  equation  of  the  system  it  is  the  prior  standard 
deviation  of  akk, i-  In  practice,  different  values  of  A  are  sometimes  tried.  Using 
different  A’s  in  different  equations  may  also  be  considered. 

Because  it  is  believed  that  coefficients  of  high  order  lags  are  likely  to  be 
close  to  zero,  the  prior  variance  decreases  with  increasing  lag  length  l.  Fur¬ 
thermore,  it  is  believed  that  most  of  the  variation  in  each  of  the  variables 
is  accounted  for  by  own  lags.  Therefore  coefficients  of  variables  other  than 
the  dependent  variable  are  assigned  a  smaller  variance  in  relative  terms  by 
choosing  8  between  0  and  1,  for  instance,  9  =  .2.  The  ratio  of/er-  is  included 
to  take  care  of  the  differences  in  the  variability  of  the  different  variables.  Here 
the  residual  variances  are  preferred  over  the  yu  variances  because  it  is  as¬ 
sumed  that  the  response  of  one  variable  to  another  is  largely  determined  by 
the  unexpected  movements  reflected  in  the  residual  variance.  Finally,  the  as¬ 
sumption  of  a  diagonal  Va  matrix  means  that  independent  prior  distributions 
of  the  different  coefficients  are  specified.  This  specification  mainly  reflects  our 
inability  to  model  dependencies  between  the  coefficients. 

As  an  example  consider  a  bivariate  VAR(2)  system  consisting  of  the  two 
equations 

2/14  =  C*ll,l2/l,t-l  +  £*12,12/2,4-1  +  £*11, 22/1, £  —  2  +  £*12,22/2, £  —  2  +  «1 4, 

(A)  (A//cq/ £*2)  (A/2)  (\9<ji/2<j2) 

2/24  =  £*21,12/1,4  —  1  +  £*22,12/2,4—1  +  £*21,22/1,4-2  +  £*22,22/2,4-2  +  «24 , 

(A  8a2/a1)  (A)  (A  8u2/2a^  (A/2) 

(5.4.11) 

where  the  prior  standard  deviations  are  given  in  parentheses.  The  prior  co- 
variance  matrix  of  the  eight  coefficients  of  this  system  is 

Fa  = 
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In  terms  of  (5.4.7),  this  prior  may  be  specified  by  choosing  c  =  0  and  C  an 
(8  x  8)  diagonal  matrix  with  the  square  roots  of  the  reciprocals  of  the  diagonal 
elements  of  Va  on  the  main  diagonal. 

5.4.4  Practical  Considerations 

In  specifying  the  Minnesota  priors,  even  if  A  and  9  are  chosen  appropriately, 
there  remain  some  practical  problems.  The  first  results  from  the  fact  that 
Su  is  usually  unknown.  In  a  strict  Bayesian  approach,  a  prior  p.d.f.  for  the 
elements  of  Su  would  be  chosen.  However,  that  would  lead  to  a  more  difficult 
posterior  distribution  for  a.  Therefore,  a  more  practical  approach  is  to  replace 
the  (J i  by  the  square  roots  of  the  diagonal  elements  of  the  LS  or  ML  estimator 
of  Eu,  e.g.j 

Eu  =  Y (IT  -  X'(XX')  1X)Y,/T. 

A  second  problem  is  the  computational  expense  that  may  result  from  the 
inversion  of  the  matrix  V”1  +  {XX’  ®  A7”1)  or  C'C  +  (XX'  ®  A”1)  in  the 
posterior  mean  a  which  is  usually  used  as  an  estimator  for  cx.  This  matrix  has 
dimension  ( K2p  x  K2p).  Because  in  a  Bayesian  analysis  sometimes  one  may 
want  to  choose  a  large  order  p  and  put  tight  zero  priors  on  the  coefficients 
of  large  lags  rather  than  make  them  zero  with  probability  1,  like  in  an  order 
selection  approach,  the  dimension  of  the  matrix  to  be  inverted  in  computing 
a  may  be  quite  substantial,  although  this  may  not  be  a  concern  with  mod¬ 
ern  computing  technology.  Still,  Bayesian  estimation  is  sometimes  applied  to 
each  of  the  K  equations  of  the  system  individually.  For  instance,  for  the  fc-tli 
equation, 

ak  :=  [I4-1  +  afXXr^al  +  *k  2Xy(k)\  (5.4.12) 

may  be  used  as  an  estimator  of  the  parameters  ak  (the  transpose  of  the  k- 
th  row  of  A  =  [Ai, . . . ,  Ap\).  Here  ak  is  the  prior  mean  and  14  is  the  prior 
covariance  matrix  of  ak  and  y'(k)  is  the  k- th  row  of  Y.  Using  (5.4.12)  instead 
of  (5.4.5)  reduces  the  computational  expense  a  bit. 

A  further  problem  is  related  to  the  zero  mean  assumption  made  in  the 
foregoing  for  the  process  yt-  In  practice,  one  may  simply  subtract  the  sample 
mean  from  each  variable  and  then  perform  a  Bayesian  analysis  for  the  mean- 
adjusted  data.  This  amounts  to  assuming  that  no  prior  information  exists 
for  the  mean  terms.  Alternatively,  intercept  terms  may  be  included  in  the 
analysis.  If  the  prior  information  is  specified  in  terms  of  (5.4.7),  it  is  easy  to 
leave  the  intercept  terms  unrestricted,  if  desired. 

5.4.5  An  Example 

To  illustrate  the  Bayesian  approach,  we  have  computed  estimates  ak  as  in 
(5.4.12)  for  the  investment/income/consumption  example  data  with  different 
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values  of  A  and  9.  Again  we  use  first  differences  of  logarithms  of  the  data 
for  the  years  1960-1978.  In  Table  5.3,  we  give  estimates  for  the  investment 
equation  of  a  VAR(2)  model.  In  a  Bayesian  analysis,  one  would  usually  choose 
a  larger  VAR  order.  For  illustrative  purposes,  the  VAR(2)  model  is  helpful, 
however. 


Table  5.3.  Bayesian  estimates  of  the  investment  equation  from  the  investment/in¬ 
come/consumption  system 


A 

e 

V 1 

cm,i 

OL  12,1 

a  13,1 

cm, 2 

a.  12,2 

<213,2 

00 

1 

-.017 

-.320 

.146 

.961 

-.161 

.115 

.934 

1 

.99 

-.015 

-.309 

.159 

.921 

-.147 

.135 

.854 

.1 

.99 

.008 

-.096 

.150 

.297 

-.011 

.062 

.100 

.01 

.99 

.018 

-.001 

.003 

.005 

-.000 

.000 

.001 

1 

.50 

-.013 

-.301 

.194 

.847 

-.141 

.165 

.718 

1 

.10 

.009 

-.245 

.190 

.369 

-.099 

.074 

.137 

1 

.01 

.023 

-.208 

.004 

.007 

-.078 

.001 

.002 

In  the  investment  equation,  the  parameter  A  controls  the  overall  prior  vari¬ 
ance  of  all  VAR  coefficients  while  9  controls  the  tightness  of  the  variances  of 
the  coefficients  of  lagged  income  and  consumption.  Roughly  speaking,  9  spec¬ 
ifies  the  fraction  of  the  prior  standard  deviation  A  attached  to  the  coefficients 
of  lagged  income  and  consumption.  Thus,  a  value  of  9  close  to  one  means  that 
all  coefficients  of  lag  1  have  about  the  same  prior  variance  except  for  a  scaling 
factor  that  takes  care  of  the  different  variability  of  different  variables.  Note 
that  the  intercept  terms  are  not  restricted  (prior  variance  set  to  oo). 

We  assume  a  prior  mean  of  zero  for  all  coefficients,  a*k  =  0,  and  thus  shrink 
towards  zero  by  tightening  the  prior  standard  deviation  A.  The  effect  is  clearly 
reflected  in  Table  5.3.  For  9  =  .99  and  A  =  1  we  get  coefficient  estimates  which 
are  quite  similar  to  unrestricted  LS  estimates  (A  =  oo  ,9  =  1).  Decreasing  A 
to  zero  tightens  the  prior  variance  and  shrinks  all  VAR  coefficients  to  zero. 
For  A  =  .01,  they  are  quite  close  to  zero  already.  On  the  other  hand,  moving 
the  variance  fraction  9  towards  zero  shrinks  the  consumption  and  income 
coefficients  (012,1,0:13,*)  towards  zero.  In  Table  5.3,  for  A  =  1  and  9  =  .01 
they  are  seen  to  be  almost  zero.  This,  of  course,  has  some  impact  on  the 
investment  coefficients  (on,*)  too. 

5.4.6  Classical  versus  Bayesian  Interpretation  of  a  in  Forecasting 
and  Structural  Analysis 

If  the  coefficients  of  a  VAR  process  are  estimated  by  a  Bayesian  procedure, 
the  estimated  process  may  be  used  for  prediction  and  economic  analysis,  as 
described  in  the  previous  sections.  Again  one  question  of  interest  concerns 
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the  statistical  properties  of  the  resulting  forecasts  and  impulse  responses.  It 
is  possible  to  interpret  a  in  (5.4.5)  or  (5.4.8)  as  an  estimator  in  the  classical 
sense  and  to  answer  this  question  in  terms  of  asymptotic  theory,  as  in  the  pre¬ 
vious  sections.  In  the  classical  context,  a  may  be  interpreted  as  a  shrinkage 
estimator  or  under  the  heading  of  estimation  with  stochastic  restrictions  (e.g., 
Judge,  Griffiths,  Hill,  Liitkepohl  &  Lee  (1985,  Chapter  3)).  In  regression  mod¬ 
els  with  nonstochastic  regressors,  such  estimators,  under  suitable  conditions, 
have  smaller  mean  squared  errors  than  ML  estimators  in  small  samples.  In 
the  present  framework,  the  small  sample  properties  are  unknown  in  general. 

To  derive  asymptotic  properties,  let  us  consider  the  representation  (5.4.8). 
It  is  easily  seen  that,  under  our  standard  conditions, 

..  f C'C  XX'  , 

Plmi  a.  =  plim  I  ®  Xu 


x  plim 
plim 


C'c  , 

(X-'YX'W 

-jr+  vec  ( 

V  T  )\ 

XX' 

T 


-l 


plim  vec 


Zu'YX' 

T 


Here  plim  C'C/T  =  lim  C'C/T  =  0  and  plim  C"c/T  =  0  has  been  used. 
Moreover,  viewing  a  as  an  estimator  in  the  classical  sense,  it  has  the  same 
asymptotic  distribution  as  the  unconstrained  multivariate  LS  estimator, 

a  =  vec  (YX'(XX')-1), 


because 

Vr(a  —  a)  = 


C'C  XX' 

+ 


T 


T 


-l 


ITT  +  TTvec(s'lYX'\ 


(XX' 


-i 


1 


Vt 


vec(£'-1FX/)  AO. 


Thus,  a  and  a  have  the  same  asymptotic  distribution  by  Proposition  C.2(2) 
of  Appendix  C.  This  result  is  intuitively  appealing  because  it  shows  that  the 
contribution  of  the  prior  information  becomes  negligible  when  the  sample  size 
approaches  infinity  and  the  sample  information  becomes  exhaustive.  Yet  the 
result  is  not  very  helpful  when  a  small  sample  is  given  in  a  practical  situation. 

Consequently,  it  may  be  preferable  to  base  the  analysis  on  the  posterior 
distribution  of  ct.  In  general,  it  will  be  difficult  to  derive  the  distribution  of, 
say,  the  impulse  responses  from  the  posterior  distribution  of  a  analytically.  In 
that  case,  one  may  obtain,  for  instance,  confidence  intervals  of  these  quantities 
from  a  simulation.  That  is,  a  large  number  of  samples  is  drawn  from  the 
posterior  distribution  of  a  and  the  corresponding  impulse  response  coefficients 
are  computed.  The  required  percentage  points  are  then  estimated  from  the 
empirical  distributions  of  the  estimated  impulse  responses  (see  Appendix  D). 
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5.5  Exercises 

In  the  following  exercises,  the  notation  of  the  previous  sections  of  this  chapter 
is  used. 

5.5.1  Algebraic  Exercises 

Problem  5.1 

Show  that  7  given  in  (5.2.14)  minimizes 

(z  -  (Z'  ®  IK)R'l)\*  ~  (Z'  ®  IK)Rl) 

with  respect  to  7. 

Problem  5.2^ 

Prove  that  (3  given  in  (5.2.16)  minimizes 

[y  -  (Z‘  ®  £  ^-^[y  -  (zr  ®  1k)(3\ 

subject  to  the  restriction  C(3  =  c,  where  C  is  ( N  x  I\(Kp  +  1))  of  rank  N 
and  c  is  ( N  x  1).  (Hint:  Specify  the  appropriate  Lagrange  function  and  find 
its  stationary  point  as  described  in  Appendix  A.  14.) 

Problem  5.3 

Show  that  7  given  in  (5.2.17)  is  the  ML  estimator  of  7.  (Hint:  Use  the  partial 
derivatives  from  Section  3.4.) 

Problem  5.^ 

Prove  Proposition  5.6. 

Problem  5.5 

Derive  the  asymptotic  distribution  of  the  EGLS  estimator  of  the  parameter 
vector  a.  :=  vec(Ai, . . . ,  Ap),  based  on  mean-adjusted  data,  subject  to  restric¬ 
tions  a  =  i?7  +  r,  where  R ,  7,  and  r  have  suitable  dimensions. 

Problem  5.6 

Consider  the  recursive  system  of  Section  5.2.5, 

Vt  =  V  +  A-o  Vt  +  •  •  •  +  Apyt-p  +  £t , 

where  £y  has  a  diagonal  covariance  matrix  He.  Show  that  e'tIJ~1£t  and 
£t£t  assume  their  minima  with  respect  to  the  unknown  parameters  for  the 
same  values  of  77,  Aq,  . . . ,  A*. 

(Hint:  Note  that 

J2£tSel£t  = 

t=l  k-1  t- 1 

and  consider  the  partial  derivatives  with  respect  to  the  coefficients  of  the  k- 
th  equation.  Here  £kt.  is  the  fc-th  element  of  e*  and  is  the  fc-th  diagonal 
element  of  He.) 
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5.5.2  Numerical  Problems 

The  following  problems  require  the  use  of  a  computer.  They  are  based  on  the 
bivariate  time  series  yt  =  {yu,y2t)'  of  first  differences  of  the  U.S.  investment 
data  provided  in  File  E2. 

Problem  <5.7 

Fit  a  VAR(2)  model  to  the  first  differences  of  the  data  from  File  E2  subject  to 
the  restrictions  ai2,i  =  0 ,i  =  1,2.  Determine  the  EGLS  parameter  estimates 
and  estimates  of  their  asymptotic  standard  errors.  Perform  an  F-test  to  check 
the  restrictions. 

Problem  5.8 

Based  on  the  result  of  the  previous  problem,  perform  an  impulse  response 
analysis  for  yi  and  j/2- 

Problem  5.9 

Use  a  maximum  order  of  4  and  the  AIC  criterion  to  determine  an  optimal 
subset  VAR  model  for  yt  with  the  top-down  strategy  described  in  Section 
5.2.8.  Repeat  the  exercise  with  the  HQ  criterion.  Compare  the  two  models 
and  interpret. 

Problem  5.10 

Based  on  the  results  of  Problem  5.9,  perform  an  impulse  response  analysis  for 
yi  and  2/2  and  compare  the  result  to  those  of  Problem  5.8. 

Problem  5.11 

Use  the  Minnesota  prior  with  A  =  1  and  6  =  .2  and  compute  the  posterior 
mean  of  the  coefficients  of  a  VAR(4)  model  for  the  mean-adjusted  yt  ■  Compare 
this  estimator  to  the  unconstrained  multivariate  LS  estimator  of  a  VAR(4) 
model  for  the  mean-adjusted  data.  Repeat  the  exercise  with  a  VAR(4)  model 
that  contains  intercept  terms. 
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Cointegrated  Processes 
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In  Part  I,  stationary,  stable  VAR  processes  have  been  considered.  Recall  that 
a  process  is  stationary  if  it  has  time  invariant  first  and  second  moments.  This 
property  implies  that  there  are  no  trends  (trending  means)  or  shifts  in  the 
mean  or  in  the  covariances.  Moreover,  there  are  no  deterministic  seasonal 
patterns.  In  this  part,  nonstationary  processes  of  a  very  specific  type  will 
be  considered.  In  particular,  the  processes  will  be  allowed  to  have  stochastic 
trends.  They  are  then  called  integrated.  If  some  of  the  variables  move  together 
in  the  long-run  although  they  have  stochastic  trends,  they  are  driven  by  a 
common  stochastic  trend  and  they  are  called  cointegrated.  VAR  processes  with 
integrated  and  cointegrated  variables  are  analyzed  in  this  part.  In  Chapter  6, 
some  important  theoretical  properties  of  cointegrated  processes  are  discussed 
and  it  is  shown  that  they  can  be  conveniently  summarized  in  a  vector  error 
correction  model  (VECM).  Estimation  of  such  models  is  treated  in  Chapter 
7.  Specification  of  VECMs  and  model  checking  are  considered  in  Chapter  8. 


6 


Vector  Error  Correction  Models 


As  defined  in  Chapter  2,  a  process  is  stationary  if  it  has  time  invariant  first 
and  second  moments.  In  particular,  it  does  not  have  trends  or  changing  vari¬ 
ances.  A  VAR  process  has  this  property  if  the  determinantal  polynomial  of  its 
VAR  operator  has  all  its  roots  outside  the  complex  unit  circle.  Clearly,  sta¬ 
tionary  processes  cannot  capture  some  main  features  of  many  economic  time 
series.  For  example,  trends  (trending  means)  are  quite  common  in  practice. 
For  instance,  the  original  investment,  income,  and  consumption  data  used  in 
many  previous  examples  have  trends  (see  Figure  3.1).  Thus,  if  interest  centers 
on  analyzing  the  original  variables  (or  their  logarithms)  rather  than  the  rates 
of  change,  it  is  necessary  to  have  models  that  accommodate  the  nonstationary 
features  of  the  data.  It  turns  out  that  a  VAR  process  can  generate  stochastic 
and  deterministic  trends  if  the  determinantal  polynomial  of  the  VAR  opera¬ 
tor  has  roots  on  the  unit  circle.  In  fact,  it  is  even  sufficient  to  allow  for  unit 
roots  (roots  for  z  —  1)  to  obtain  a  trending  behavior  of  the  variables.  We 
will  consider  this  case  in  some  detail  in  this  chapter.  In  the  next  section,  the 
effect  of  unit  roots  in  the  AR  operator  of  a  univariate  process  will  be  ana¬ 
lyzed.  Variables  generated  by  such  processes  are  called  integrated  variables 
and  the  underlying  generating  processes  are  integrated  processes.  Vector  pro¬ 
cesses  with  unit  roots  are  considered  in  Section  6.2.  In  these  processes,  some 
of  the  variables  can  have  common  trends  so  that  they  move  together  to  some 
extent.  They  are  then  called  cointegrated.  This  feature  is  considered  in  detail 
in  Section  6.3  and  it  is  shown  that  vector  error  correction  models  ( VECMs ) 
offer  a  convenient  way  to  parameterize  and  specify  them.  In  Section  6.3,  the 
processes  are  assumed  to  be  purely  stochastic  and  do  not  have  deterministic 
terms.  How  to  incorporate  these  terms  is  the  subject  of  Section  6.4.  Once  we 
have  a  suitable  model  setup,  it  can  be  used  for  forecasting,  causality  analysis, 
and  impulse  response  analysis.  These  issues  are  treated  in  Sections  6. 5-6. 7. 
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6.1  Integrated  Processes 

Recall  that  a  VAR(p)  process, 

Vt  =  Aiyt-i  H - 1-  Apyt-p  +  ut,  (6.1.1) 

is  stable  if  the  polynomial  defined  by 
det  (1K  —  A1z  —  ■  ■  ■  —  Apzp ) 

has  no  roots  in  and  on  the  complex  unit  circle.  For  a  univariate  AR(1)  process, 
Ut  =  ctyt-i  +  Ut,  this  property  means  that 

1  —  az  ^  0  for  \z\  <  1 

or,  equivalently,  |a|  <  1. 

Consider  the  borderline  case,  where  a  =  1.  The  resulting  process  yt  = 
yt-i  +  Ut  is  called  a  random  walk.  Starting  the  process  at  t  =  0  with  some 
fixed  j/o  i  it  is  easy  to  see  by  successive  substitution  for  lagged  yt’ s,  that 

t 

yt  =  yt-i  +ut  =  yt—2  +  ut-i  +  ut  =  •  •  •  =  y0  +  ^  (6.1.2) 

i- 1 

Thus,  yt  consists  of  the  sum  of  all  disturbances  or  innovations  of  the  previous 
periods  so  that  each  disturbance  has  a  lasting  impact  on  the  process.  If  ut  is 
white  noise  with  variance  cr^, 


E(yt )  =  yo 

and 

Var(/ h)  =  tVar(ut)  =  ta2a. 


Hence,  the  variance  of  a  random  walk  tends  to  infinity.  Furthermore,  the 
correlation 


Corr(7/t,  yt+h) 


E  [(£”•)  (!>•)’ 

[tal(t  +  h)al]  V2 

_ t _ ,1 

{t2  +  thy/2  t—>oc 


for  any  integer  h.  This  latter  property  of  a  random  walk  means  that  yt  and 
ys  are  strongly  correlated  even  if  they  are  far  apart  in  time.  It  can  also  be 
shown  that  the  expected  time  between  two  crossings  of  zero  is  infinite.  These 
properties  are  often  reflected  in  trending  behavior.  Examples  are  depicted  in 
Figure  6.1.  This  kind  of  trend  is,  of  course,  not  a  deterministic  one  but  a 
stochastic  trend. 
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If  the  process  has  a  nonzero  constant  term  u,  yt  =  v  +  yt_  i  +  ut  is  called 
a  random  walk  with  drift  and  it  has  a  deterministic  linear  trend  in  the  mean. 
To  see  this  property,  suppose  again  that  the  process  is  started  at  t  =  0  with 
a  fixed  yo .  Then 

t 

yt  =  yo  +  tv  +  'Y'J  u.i 
2  —  1 

and  E(yt)  =  yo  +  tv.  An  example  of  a  time  series  generated  by  a  random  walk 
with  drift  is  shown  in  Figure  6.2. 


Fig.  6.2.  An  artificially  generated  random  walk  with  drift. 


The  previous  discussion  suggests  that  starting  unstable  processes  at  some 
finite  time  to  is  useful  to  obtain  processes  with  finite  moments.  On  the  other 
hand,  if  an  AR  process  starts  at  some  finite  time,  it  is  strictly  speaking  not 
necessarily  stationary,  even  if  it  is  stable.  To  see  this  property,  let  yt  =  v  + 
ayt- 1  +  ut  be  a  univariate  stable  AR(1)  process  with  |a|  <  1.  Starting  with  a 
random  variable  t/o  at  t  =  0,  gives 

t- 1  t- 1 

yt  =  v'^a1  +  a*  j/o  +  ^  alut-i- 
i= 0  i=0 
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Hence, 


t- 1 

E{yt)  =  +  atE{y0) 

2=0 

is  generally  not  time  invariant  if  a  and  v  ^  0.  A  similar  result  is  obtained  for 
the  second  moments, 

t- 1 

Var (yt)  =  a2tVar(j/o)  +  ec2  ^  a21. 

i—O 

However,  the  first  and  second  moments  approach  limit  values  as  t  — »  oo  and 
one  might  call  such  a  process  asymptotically  stationary.  To  simplify  matters, 
the  term  “asymptotically”  is  sometimes  dropped  and  such  processes  are  then 
simply  called  stationary.  Moreover,  if  we  consider  purely  stochastic  processes 
without  deterministic  terms  {v  =  0),  the  initial  variable  can  be  chosen  such 
that  yt  is  stationary  if  the  process  is  stable.  In  particular,  if  we  choose 

OO 

=  alU-i 
2=0 

we  get,  for  v  =  0, 

oo  t—  1  oo 

yt=atYl  alu-i  +  E  “’“t-*  =  E  t=  1,2,..., 

2  =  0  2=0  2=0 

and,  hence,  for  t  =  1,2,..., 

E(yt)  =  0, 


Var(yt)  =  o-2/(l-a2), 

and  also  the  autocovariances  are  time  invariant.  Thus,  for  a  stable  process 
we  may  in  fact  choose  the  initial  variable  such  that  yt  is  stationary  even  if 
the  process  is  started  in  some  given  period.  This  result  can  also  be  used  as  a 
justification  for  simply  calling  stable  processes  stationary  in  this  situation.  We 
may  implicitly  assume  that  the  starting  value  is  chosen  to  justify  the  termi¬ 
nology.  For  our  purposes,  this  point  is  of  limited  importance  because  in  later 
chapters  we  will  be  interested  in  the  parameters  of  the  processes  considered 
and  possibly  in  their  asymptotic  moments.  Without  further  warning,  nonsta¬ 
tionary,  unstable  processes  will  be  assumed  to  begin  at  some  given  finite  time 
period. 

A  behavior  similar  to  that  of  a  random  walk  is  also  observed  for  higher 
order  AR  processes  such  as 


yt  =  v  +  upyt_  i  H - b  apyt_p  +  ut 
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if  1  —  «i  z  —  ■■■  —  apzp  has  a  root  for  z  =  1.  Note  that 

1  —  a\z  —  ■  ■  ■  —  apzp  =  (1  —  Ai z)  ■  ■  •  (1  —  A pz), 

where  Ai,...,Ap  are  the  reciprocals  of  the  roots  of  the  polynomial.  If  the 
process  has  just  one  unit  root  (a  root  equal  to  1)  and  all  other  roots  are 
outside  the  complex  unit  circle,  its  behavior  is  similar  to  that  of  a  random 
walk,  that  is,  its  variances  increase  linearly,  the  correlation  between  variables 
h  periods  apart  tends  to  1  and  the  process  has  a  linear  trend  in  mean  if  v  f  0. 
In  case  one  of  the  roots  is  strictly  inside  the  unit  circle,  the  process  becomes 
explosive,  that  is,  its  variances  go  to  infinity  at  an  exponential  rate.  Many 
researchers  feel  that  such  processes  are  unrealistic  models  for  most  economic 
data.  Although  processes  with  roots  on  the  unit  circle  other  than  one  are 
often  useful,  we  shall  concentrate  on  the  case  of  unit  roots  and  all  other  roots 
outside  the  unit  circle.  This  situation  is  of  considerable  practical  interest. 

Univariate  processes  with  d  unit  roots  (d  roots  equal  to  1)  in  their  AR 
operators  are  called  integrated  of  order  d  (1(d)).  If  there  is  just  one  unit  root, 
i.e.,  the  process  is  7(1),  it  is  quite  easy  to  see  how  a  stable  and  possibly 
stationary  process  can  be  obtained:  simply  by  taking  first  differences,  Ayt  := 
(1  —  L)yt  =  yt~  yt- i,  of  the  original  process.  More  generally,  if  the  process  is 
1(d)  it  can  be  made  stable  by  differencing  d  times,  that  is,  Adyt  =  (1  —  L)dyt 
is  stable  and,  again,  initial  values  can  be  chosen  such  that  it  is  stationary.  In 
the  following,  it  will  often  be  convenient  to  extend  this  terminology  also  to 
stable,  stationary  processes  and  to  call  them  7(0). 

More  generally,  yt  may  be  defined  to  be  an  1(1)  process,  if  Ayt  =  wt  is 
a  stationary  process  with  infinite  MA  representation,  wt  =  ®jut-j  = 

9(L)ut ,  where  the  MA  coefficients  satisfy  the  condition  <  00 > 

0(1)  =  ^  0’  an(l  Ut  ~  (0,  cr^)  is  white  noise.  In  that  case,  yt  = 

yt- 1  +  Wt  can  be  rewritten  as 


Vt  =  Vo  +  w1-\ - b wt  =  2/o  +  0(1)(wH - b ut)  +  y^y9*ut-j  -  Wp,  (6.1.3) 

3=0 

where  9*  =  —  Y^tLj+i  0*,  j  =  0, 1, . . . ,  and  Wq  =  ®ju-j  contains  initial 
values.  Thus,  yt  can  be  represented  as  the  sum  of  a  random  walk  [0(1) (u±  + 

•  •  •  +  ut)],  a  stationary  process  9*ut-j\,  and  initial  values  [j/o  —  Wq}. 

Notice  that  the  condition  J2JLoj\^j\  <  00  ensures  that  <  00 >  so 

that  indeed  well-defined  by  Proposition  C.7  of  Appendix  C.3. 

Although  the  condition  for  the  9j  is  stronger  than  absolute  summability,  it 
is  satisfied  for  many  processes  of  practical  interest.  The  decomposition  of  yt 
in  (6.1.3)  is  known  as  the  Beveridge-Nelson  decomposition  (see  also  Appendix 
C.8).  A  similar  decomposition  for  multivariate  processes  is  helpful  in  some  of 
the  subsequent  analysis.  It  will  be  discussed  in  Section  6.3. 
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6.2  VAR  Processes  with  Integrated  Variables 

Consider  now  a  A'-dimensional  VAR(p)  process  without  a  deterministic  term 
as  in  (6.1.1).  It  can  be  written  as 


A{L)yt  =  ut,  (6.2.1) 

where  A(L)  :=  1K  —  AiL  —  ■  ■  •  —  ApLp  and  L  is  the  lag  operator.  Multiplying 
from  the  left  by  the  adjoint  A(L)adi  of  A(L)  gives 

\A{L)\yt  =  A{L)a*ut  (6.2.2) 

(see  Appendix  A. 4.1  for  the  definition  of  the  adjoint  of  a  matrix).  Thus,  the 
VAR(p)  process  in  (6.2.1)  can  be  written  as  a  process  with  univariate  AR 
operator,  that  is,  all  components  have  the  same  AR  operator  |A(A)|.  The 
right-hand  side  of  (6.2.2),  A{L)ad^Ut,  is  a  finite  order  MA  process  (see  Chapter 
11  for  further  discussion  of  such  processes).  If  |A(L)|  has  d,  unit  roots  and 
otherwise  all  roots  are  outside  the  unit  circle,  the  AR  operator  can  be  written 
as 


\A(L)\  =  a(L)(l-L)d  =  a(L)Ad, 

where  a(L)  is  an  invertible  operator.  Consequently,  Adyt  is  a  stable  process. 
Hence,  each  component  becomes  stable  upon  differencing. 

Because  we  are  considering  processes  which  are  started  at  some  specific 
time  to,  we  should  perhaps  think  for  a  moment  about  the  treatment  of  initial 
values  when  multiplying  by  an  operator  such  as  A(L)ad:>  in  (6.2.2).  One  pos¬ 
sible  assumption  is  that  the  new  representation  is  valid  for  all  t  for  which  the 
yt  s  are  defined  in  (6.2.1). 

The  foregoing  discussion  shows  that  if  a  VAR(p)  process  is  unstable  be¬ 
cause  of  unit  roots  only,  it  can  be  made  stable  by  differencing  its  components. 
Note,  however,  that,  due  to  cancellations,  it  may  not  be  necessary  to  difference 
each  component  as  many  times  as  there  are  unit  roots  in  |A(L)|.  To  illustrate 
this  point,  consider  the  bivariate  VAR(l)  process 


( 

'10' 

'10' 

yu 

(1  -  L)yu 

0  1 

0  1 

) 

.  y*t . 

.  (1  -  L)y2t  _ 

Obviously,  each  component  is  stationary  after  differencing  once,  i.e. ,  each 
component  is  7(1),  although 


\A(L)\ 


r  i -l 

0  1 

o 

_ i 

1  -  L 

(1  —  L)2 


has  two  unit  roots.  It  is  also  possible  that  some  components  are  stable  and 
stationary  as  univariate  processes  whereas  others  need  differencing.  Examples 
are  easy  to  construct. 
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If  the  VAR(p)  process  has  a  nonzero  intercept  term  so  that 
A(L)yt  =  v  +  ut 


and  |.A(z)|  has  one  or  more  unit  roots,  then  some  of  the  components  of  yt  may 
have  deterministic  trends  in  their  mean  values.  Unlike  the  univariate  case,  it  is 
also  possible,  however,  that  none  of  the  components  of  yt  has  a  deterministic 
trend  in  mean.  This  occurs  if  A(L)od:,V  =  0.  For  instance,  if 


m 


1  -  L  yL 
0  1  ' 


\A{z)\  has  a  unit  root  and 
A(L)adj  = 

Hence, 


v\  -  yv2 
v2  -  v2 

which  is  zero  if  U\  =  77^2 •  Thus,  in  a  VAR  analysis  an  intercept  term  cannot 
be  excluded  a  priori  if  there  are  unit  roots  and  none  of  the  component  series 
has  a  deterministic  trend. 

The  following  question  comes  to  mind  in  this  context.  Suppose  each  com¬ 
ponent  of  a  VAR(p)  process  is  1(d) ,  is  it  possible  that  differencing  each  com¬ 
ponent  individually  distorts  interesting  features  of  the  relationship  between 
the  original  variables?  If  the  latter  were  not  the  case,  a  VAR  analysis  could  be 
performed  as  described  in  previous  chapters  after  differencing  the  individual 
components.  It  turns  out,  however,  that  differencing  may  indeed  distort  the  re¬ 
lationship  between  the  original  variables.  Systems  with  cointegrated  variables 
are  examples,  where  fitting  VAR  models  upon  differencing  may  be  inadequate. 
Such  systems  are  introduced  next. 


A(L) 


adj 


Vl 

V2 


1  —  yL 

0  1  -L 


6.3  Cointegrated  Processes,  Common  Stochastic  Trends, 
and  Vector  Error  Correction  Models 

Equilibrium  relationships  are  suspected  between  many  economic  variables 
such  as  household  income  and  expenditures  or  prices  of  the  same  commod¬ 
ity  in  different  markets.  Suppose  the  variables  of  interest  are  collected  in 
the  vector  yt  =  (yit,  ■  ■  ■  ,  UKt)'  and  their  long-run  equilibrium  relation  is 

P  'yt  =  Pi  yit  H - 1-  P  KVKt  =  0,  where  p  =  (p: . PA-)'.  In  any  particular 

period,  this  relation  may  not  be  satisfied  exactly  but  we  may  have  p 'yt  =  Zt, 
where  Zt  is  a  stochastic  variable  representing  the  deviations  from  the  equi¬ 
librium.  If  there  really  is  an  equilibrium,  it  seems  plausible  to  assume  that 


6.3  Cointegrated  Processes  and  VECMs  245 


the  yt  variables  move  together  and  that  zt  is  stable.  This  setup,  however, 
does  not  exclude  the  possibility  that  the  yt  variables  wander  extensively  as 
a  group.  Thus,  they  may  be  driven  by  a  common  stochastic  trend.  In  other 
words,  it  is  not  excluded  that  each  variable  is  integrated,  yet  there  exists  a  lin¬ 
ear  combination  of  the  variables  which  is  stationary.  Integrated  variables  with 
this  property  are  called  cointegrated.  In  Figure  6.3,  two  artificially  generated 
cointegrated  time  series  are  depicted. 


un 


Fig.  6.3.  A  bivariate  cointegrated  time  series. 


Generally,  the  variables  in  a  AT-dimensional  process  yt  are  called  cointe¬ 
grated  of  order  ( d,b ),  briefly,  yt  ~  CI(d,b ),  if  all  components  of  yt  are  1(d) 
and  there  exists  a  linear  combination  zt  :=  p ' yt  with  p  =  (pl5 . . . ,  PA-)'  ^  0 
such  that  zt  is  I(d  —  b).  For  instance,  if  all  components  of  yt  are  /(l)  and  P ' yt 
is  stationary  (1(0)),  then  yt  ~  (7/(1, 1).  The  vector  p  is  called  a  cointegrating 
vector  or  a  cointegration  vector.  A  process  consisting  of  cointegrated  variables 
is  called  a  cointegrated  process.  These  processes  were  introduced  by  Granger 
(1981)  and  Engle  &  Granger  (1987).  Since  then  they  have  become  popular  in 
theoretical  and  applied  econometric  work. 

In  the  following,  a  slightly  different  definition  of  cointegration  will  be  used 
in  order  to  simplify  the  terminology.  We  call  a  A'-dimensional  process  yt  in¬ 
tegrated  of  order  d,  briefly,  yt  ~  1(d),  if  Adyt  is  stable  and  Ad~1yt  is  not 
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stable.  The  1(d)  process  yt  is  called  cointegrated  if  there  is  a  linear  combina¬ 
tion  p  yt  with  P  7^  0  which  is  integrated  of  order  less  than  d.  This  definition 
differs  from  the  one  given  by  Engle  &  Granger  (1987)  in  that  we  do  not  ex¬ 
clude  components  of  yt  with  order  of  integration  less  than  d.  If  there  is  just 
one  1(d)  component  in  yt  and  all  other  components  are  stable  (1(0)),  then 
the  vector  yt  is  1(d)  according  to  our  definition  because  Adyt  is  stable  and 

Ad~i 

yt  is  not.  In  such  a  case  a  relation  p ' yt  that  involves  the  stationary  com¬ 
ponents  only  is  a  cointegration  relation  in  our  terms.  Clearly,  this  aspect  of 
our  definition  is  not  in  line  with  the  original  idea  of  cointegration  as  a  special 
relation  between  integrated  variables  with  common  stochastic  trends.  In  the 
following,  our  definition  is  still  useful  because  it  simplifies  the  terminology  as 
it  avoids  distinguishing  between  variables  with  different  orders  of  integration. 
The  reader  should  keep  in  mind  the  basic  ideas  of  cointegration  when  it  comes 
to  interpreting  specific  relationships,  however. 

Obviously,  a  cointegrating  vector  is  not  unique.  Multiplying  by  a  nonzero 
constant  yields  a  further  cointegrating  vector.  Also,  there  may  be  various 
linearly  independent  cointegrating  vectors.  For  instance,  if  there  are  four  vari¬ 
ables  in  a  system,  the  first  two  may  be  connected  by  a  long-run  equilibrium 
relation  and  also  the  last  two.  Thus,  there  may  be  a  cointegrating  vector  with 
zeros  in  the  last  two  positions  and  one  with  zeros  in  the  first  two  positions. 
In  addition,  there  may  be  a  cointegration  relation  involving  all  four  variables. 

Before  the  concept  of  cointegration  was  introduced,  the  closely  related 
error  correction  models  were  discussed  in  the  econometrics  literature  (see, 
e.g.,  Davidson,  Hendry,  Srba  &  Yeo  (1978),  Hendry  &  von  Ungern-Sternberg 
(1981),  Salmon  (1982)).  In  an  error  correction  model,  the  changes  in  a  vari¬ 
able  depend  on  the  deviations  from  some  equilibrium  relation.  Suppose,  for 
instance,  that  y\t  represents  the  price  of  a  commodity  in  a  particular  market 
and  y2t  is  the  corresponding  price  of  the  same  commodity  in  another  market. 
Assume  furthermore  that  the  equilibrium  relation  between  the  two  variables 
is  given  by  y\t  =  Pi^t  and  that  the  changes  in  y\t  depend  on  the  deviations 
from  this  equilibrium  in  period  t  —  1, 

Ayu  =  ai(yiit_i  —  p-^.t-i)  +  uit- 
A  similar  relation  may  hold  for  y2t, 

Ay2t  =  &2(yi,t- 1  —  Pi2/2,t-i)  +  u2t- 

In  a  more  general  error  correction  model,  the  Ayit  may  in  addition  depend 
on  previous  changes  in  both  variables  as,  for  instance,  in  the  following  model: 

Ayu  =  «i(t/i,t_i  -  P^t-i)  +'Y111Aylyt-1  +  Y12, i^J/2, t—i  +  wit, 

Ay2t  =  Ot2(2/i,t_i  —  PiJ/2,t-i)  +  Y2MAyM— i  +  y221Ay2t_i  +  u2t. 

(6.3.1) 


Further  lags  of  the  Ay^’s  may  also  be  included. 
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To  see  the  close  relationship  between  error  correction  models  and  the  con¬ 
cept  of  cointegration,  suppose  that  yu  and  y2t  are  both  7(1)  variables.  In  that 
case  all  terms  in  (6.3.1)  involving  the  Ayn  are  stable.  In  addition,  u\t  and  u2t 
are  white  noise  errors  which  are  also  stable.  Because  an  unstable  term  cannot 
equal  a  stable  process, 

a»(j/i,t-i  -  PrJfe.t— l)  =  Ayit  -  Ya, i^J/i,i-i  -  Y*2, iAy2,t-i  -  uit 

must  be  stable  too.  Hence,  if  CC  A  0  or  a2  /  0,  yu  —  i/2 1  is  stable  and,  thus, 

represents  a  cointegration  relation. 

In  vector  and  matrix  notation  the  model  (6.3.1)  can  be  written  as 

Ayt  =  ap'7/t_i  +  T1Ayt_1  +  utl 


or 


Vt  -  yt- 1  =  ocp'i/t_i  +  ri(yt_i  -  yt-2)  +  uu 
where  yt  :=  (yUl  y2t)' ,  ut  :=  {uit,u2tY, 


a  := 


P'  :=  (1,-Pi), 


and 


ri  := 


Yn,i  Yi2,i 
Y21,1  Y22,l 


Rearranging  terms  in  (6.3.2)  gives  the  VAR(2)  representation 

yt  =  {Ik  +  Ti  +  ap^yt-i  —  T \yt-2  +  ut- 


(6.3.2) 


Hence,  cointegrated  variables  may  be  generated  by  a  VAR  process. 

To  see  how  cointegration  can  arise  more  generally  in  AT-dimensional  VAR 
models,  consider  the  VAR(2)  process 


yt  =  Myt-I  +  A2yt- 2  +  Ut  (6.3.3) 

with  yt  =  (yu,  ■  ■  ■ ,  yKt)' ■  Suppose  the  process  is  unstable  with 

1 1K  —  A\z  —  A2z2\  =  (1  —  Ai z)  •••(!  —  A nz)  =0  for  z  =  1. 

Because  the  \i  are  the  reciprocals  of  the  roots  of  the  determinantal  polyno¬ 
mial,  one  or  more  of  them  must  be  equal  to  1.  All  other  roots  are  assumed  to 
lie  outside  the  unit  circle,  that  is,  all  A;  which  are  not  1  are  inside  the  complex 
unit  circle.  Because  \1k  —  A1  —  A2  \  =  0,  the  matrix 

II  :=  -(Ik  -  Ax  —  A2) 

is  singular.  Suppose  rk(II)  =  r  <  K.  Then  II  can  be  decomposed  as  II  =  ap/, 
where  a  and  p  are  ( K  x  r)  matrices.  From  the  discussion  in  the  previous 
section,  we  know  that  each  variable  becomes  stationary  upon  differencing. 
Let  us  assume  that  differencing  once  is  sufficient,  subtract  yt-i  on  both  sides 
of  (6.3.3)  and  rearrange  terms  as 
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Ut  ~  2/t-i  =  ~{Ik  —  Ai  —  A2)yt_ 1  —  A2yt_i  +  A2yt_2  +  ut 


or 


Ayt  =  li-yt-i  +  r  \Ayt-\  +  ut, 
where  Id  :=  —A2,  or 


(6.3.4) 


ap'yt-i  =  Ayt  -  Ti.Ayt-i  -  ut. 

Because  the  right-hand  side  involves  stationary  terms  only,  apSyt-i  must  also 
be  stationary  and  it  remains  stationary  upon  multiplication  by  (a'a)-1^.  In 
other  words,  p  yt  is  stationary  and,  hence,  each  element  of  p  yt  represents  a 
cointegrating  relation.  Note  that  simply  taking  first  differences  of  all  variables 
in  (6.3.3)  eliminates  the  cointegration  term  which  may  well  contain  relations 
of  great  importance  for  a  particular  analysis.  Moreover,  in  general,  a  VAR 
process  with  cointegrated  variables  does  not  admit  a  pure  VAR  representation 
in  first  differences. 

It  may  also  be  worth  emphasizing  that  here  we  have  worked  under  the 
assumption  that  all  variables  are  stationary  after  differencing  once.  In  general, 
variables  with  higher  integration  orders  may  also  be  present.  In  that  case,  P  yt 
may  not  be  stationary  even  if  rk(II)  =  r  <  K.  The  components  of  yt.  may 
still  be  cointegrated  of  a  higher  order  if  linear  combinations  exist  which  have 
a  reduced  order  of  integration. 

In  the  following,  we  will  be  interested  in  the  specific  case  where  all  indi¬ 
vidual  variables  are  /(l)  or  /( 0).  The  A'-dimensional  VAR(p)  process 

2 It  =  A1yt_i  +  •  •  •  +  Apyt-p  +  ut,  (6.3.5) 

is  called  cointegrated  of  rank  r  if 

II  :=  —{Ik  —  Ai  —  ■  ■  ■  —  Ap) 

has  rank  r  and,  thus,  II  can  be  written  as  a  matrix  product  ap/  with  a  and  P 
being  of  dimension  ( K  xr)  and  of  rank  r.  The  matrix  P  is  called  a  cointegrating 
or  cointegration  matrix  or  a  matrix  of  cointegrating  or  cointegration  vectors 
and  a  is  sometimes  called  the  loading  matrix.  If  r  =  0,  Ayt  has  a  stable 

VAR(p  —  1)  representation  and,  for  r  =  K,  \lK  —  A\ - —  Ap\  =  |  —  II  ^  0 

and,  hence,  the  VAR  operator  has  no  unit  roots  so  that  yt  is  a  stable  VAR(p) 
process. 

Rewriting  (6.3.5)  as  in  (6.3.4)  it  has  a  vector  error  correction  model 
( VECM)  representation 

Ayt  =  Tiyt-i  +  T\Ayt-i  +  •  •  •  +  Tp-iAyt-p+i  +  ut 

=  c^P  2/t-i  +  EiAyt-i  +  •  •  •  +  Fp-iAyt-p+i  +  ut,  (6.3.6) 

where 
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Tj  —{Ai+i  +  •  •  •  +  Ap),  i  —  1, . . .  ,p  —  1. 

If  this  representation  of  a  cointegrated  process  is  given,  it  is  easy  to  recover 
the  corresponding  VAR  form  (6.3.5)  by  noting  that 

A\  =  n  +  iK  +  rt 

Aj  =  r,  -  r,.!,  i  =  2,.  ..,p—  1,  (6.3.7) 

Ap  =  —  rp_i. 

It  may  be  worth  pointing  out  that  we  can  also  rearrange  the  terms  in  a 
different  way  and  obtain  a  representation 


Ayt  =  D\Ayt_\  - b  Dp_1Ayt-p+i  +  Iiyt-P  +  ut,  (6.3.8) 

where  the  error  correction  term  appears  at  lag  p  and 


Di  =  -(1K-A1 - Ai ),  i=l,...,p-l. 


In  the  following  sections,  we  will  usually  work  with  (6.3.5)  or  (6.3.6).  Of  course, 
thereby  we  work  within  a  much  more  narrow  framework  than  that  allowed  for 
in  the  general  definition  of  cointegration.  First,  we  consider  J(l)  processes 
only  and,  second,  the  discussion  is  limited  to  finite  order  VAR  processes  or 
VECMs. 

It  is  important  to  note  that  the  decomposition  of  the  ( K  x  K)  matrix  II 
as  the  product  of  two  ( K  x  r)  matrices,  II  =  ap/  is  not  unique.  In  fact,  for 
every  nonsingular  (r  x  r)  matrix  Q ,  we  can  define  a*  =  aQ'  and  P*  =  PQ-1 
and  get  II  =  a*p*  .  This  nonuniqueness  of  the  decomposition  of  II  shows 
again  that  the  cointegration  relations  are  not  unique.  It  is  possible,  however, 
to  impose  restrictions  on  p  and/or  a  to  get  unique  relations.  Such  restrictions 
may  be  implied  by  subject  matter  considerations  or  they  may  be  imposed  for 
convenience,  using  the  algebraic  properties  of  the  associated  matrices. 

As  an  example,  consider  a  system  of  three  interest  rates,  yt  =  (yit,  2/2*,  y3t)' , 
where  ylt  is  a  short-term  rate,  y2t  is  a  medium-term  rate,  and  y3t  is  a  long-term 
rate.  Suppose  all  three  interest  rates  are  1(1)  variables  whereas  the  interest 
rate  spreads,  yn  —  yjt  (i  =/=■  j)  are  stationary  (/(0)).  Then  we  have  two  linearly 
independent  cointegrating  relations  which  can,  for  example,  be  written  as 


P  'yt 


1  o  -1 
0  1-1 


yt 


or,  alternatively,  as 


P  *'yt 


1  -1  o 
o  1-1 


yt- 


Using  the  fact  that  rk(P)  =  r,  there  must  be  r  linearly  independent  rows. 
Thus,  by  a  suitable  rearrangement  of  the  variables  it  can  always  be  ensured 
that  the  first  r  rows  of  P  are  linearly  independent.  Hence,  the  upper  ( r  x  r) 
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submatrix  consisting  of  the  first  r  rows  of  (3  is  nonsingular.  Choosing  Q  then 
equal  to  this  matrix  gives  a  cointegration  matrix 


P* 


l, 

P(K-r)  J 


(6.3.9) 


where  P(x-r)  ((-^  —  r)  x  r).  This  normalization  will  occasionally  be  used  in 
the  following  because  it  is  quite  convenient  to  ensure  a  unique  cointegration 
matrix.  It  does  not  imply  a  loss  of  generality  except  that  it  is  assumed  that  the 
variables  are  arranged  in  the  right  way  so  that  the  normalization  is  feasible. 
If  the  system  is  known,  as  implicitly  assumed  here,  rearranging  the  variables 
in  a  suitable  way  is  no  problem,  of  course.  In  fact,  we  just  need  to  know  the 
cointegration  properties  between  all  subsets  of  variables  in  order  to  find  a 
suitable  arrangement  of  the  variables. 

To  see  this,  consider  again  a  three-dimensional  system,  yt  =  ( yu ,  y2t,  Vst)' , 
with  cointegrating  rank  1  so  that  there  is  just  one  cointegration  vector  p.  In 
that  case,  the  normalization  in  (6.3.9)  amounts  to  setting  the  first  component 
of  the  cointegration  vector  to  one.  Hence,  P*  yt  =  [1,  P(^_1)]?/t  =  Z/it  +  P22/2t  + 
P32/3t-  Clearly,  this  normalization  is  only  feasible  if  the  first  component  of  yt 
actually  belongs  to  the  cointegration  relation  and  has  nonzero  coefficient.  If 
we  know  that  y2t  and  yst  are  not  cointegrated  while  yu,  yit,  and  y^t  together 
are  cointegrated,  then  we  know  already  that  yu  is  part  of  the  cointegration 
relation  and,  thus,  has  a  nonzero  coefficient  in  p. 

As  another  example,  suppose  yt  has  cointegrating  rank  2.  In  that  case  the 
normalized  cointegrating  relations  are 


'1  0  p! 

yu  +  PiZ/3t 

0  1  p2  _ 

yt  — 

.  2/2 1  +  P22/3 t  _ 

Thus,  a  cointegration  relation  must  exist  in  the  bivariate  systems  (yu,y3t)' 
and  (y2t,y3t)'-  By  checking  these  subsystems  separately,  a  possible  ordering 
of  the  variables  is  easy  to  find.  It  may  be  worth  mentioning,  however,  that 
given  our  general  definition  of  cointegration,  it  is  possible  that  in  this  example 
yu  or/and  y2t  are  in  fact  stationary  7(0)  variables.  For  instance,  if  both  are 
1(0),  Pi  =  P2  =  0.  Recall  that  a  process  yt  is  called  7(1)  even  if  only  a  single 
component  is  7(1)  and  the  other  components  are  7(0). 

Generally,  any  stationary  variables  in  the  system  must  be  placed  in  the 
upper  r-dimensional  subvector  of  yt-  If  yut ,  the  fc-th  component  of  yt,  is  sta¬ 
tionary,  there  is  a  ‘cointegrating  relation’  p ’kyt  with  Pfe  being  a  vector  with  a 
one  as  the  fc-th  component  and  zeros  elsewhere  so  that  P ’kyt  =  yu-  Thus,  there 
is  a  cointegrating  relation  for  each  of  the  stationary  components  of  yt.  Because 
the  associated  cointegrating  vectors  are  linearly  independent,  the  cointegrat¬ 
ing  rank  must  be  at  least  as  great  as  the  number  of  7(0)  variables  in  the 
system. 

The  important  result  to  remember  from  this  discussion  is  that  the  nor¬ 
malization  of  the  cointegration  matrix  given  in  (6.3.9)  is  always  possible  if 
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the  variables  are  arranged  in  a  suitable  way.  Finding  the  proper  ordering  is 
easy  if  the  cointegration  properties  of  all  subsystems  are  known,  including  the 
univariate  subsystems.  In  other  words,  we  also  need  to  know  the  order  of  inte¬ 
gration  of  the  individual  variables  in  the  system.  In  practice,  the  order  of  in¬ 
tegration  and  the  cointegrating  rank  of  a  given  system  and  its  subsystems  will 
not  be  known.  Statistical  procedures  for  determining  the  cointegrating  rank 
which  can  help  to  overcome  this  practical  problem  are  discussed  in  Chapter 


If  the  normalization  in  (6.3.9)  is  made,  the  system  may  also  be  set  up  as 


„(i) 


Vi  '  =  -P '(K-r)yi  +  zt 
= 


(1) 


(6.3.10) 


where  y\^  and  z j1^  are  (r  x  1),  and  z^  are  (( K  —  r)  x  1)  and  zt  = 

(z^'jZ^'Y  is  a  stationary  process.  There  cannot  be  any  cointegrating  rela- 

(2) 

tions  between  the  components  of  the  subsystem  yt  ,  because  otherwise  there 
would  be  more  than  r  linearly  independent  cointegrating  relations  and  the 
cointegrating  rank  would  be  larger  than  r.  Thus,  the  variables  in  yt  '  repre¬ 
sent  stochastic  trends  in  the  system.  The  representation  (6.3.10)  is  known  as 
the  triangular  representation  of  a  cointegrated  system.  It  has  been  used  ex¬ 
tensively  in  some  of  the  literature  related  to  cointegration  analysis  (see,  e.g., 
Park  &  Phillips  (1988,  1989)). 

Yet  another  useful  representation  of  a  cointegrated  system  is  given  by 
Johansen  (1995,  Theorem  4.2).  The  underlying  result  is  often  referred  to  as 
Granger  representation  theorem.  To  state  this  representation,  we  use  the  fol¬ 
lowing  notation.  For  m  >  n,  we  denote  by  Mj_  an  orthogonal  complement  of 
the  (m  x  n)  matrix  M  with  rk(M)  =  n  (see  also  Appendix  A. 8. 2).  In  other 
words,  Mj_  is  any  (to  x  (to  —  n))  matrix  with  rk(Mj_)  =  m  —  n  and  M' Mj_  =  0. 
If  M  is  a  nonsingular  square  matrix  (to  =  n),  then  1/.  0  and  if  n  =  0,  we 

define  Mj_  =  lm.  This  latter  convention  is  sometimes  useful  to  avoid  clumsy 
notation  and  looking  at  different  cases  separately.  We  assume  that  yt  is  a  K- 
dimensional  cointegrated  1(1)  process  as  in  (6.3.6)  with  cointegration  rank  r, 
0  <r<K.  Then  the  following  proposition  holds. 


Proposition  6.1  ( Granger  Representation  Theorem ) 
Suppose 


Ayt  —  a(I  yt—  l  +  T\Ayt-i  +  •  •  •  +  rp_iZ\yt_p+i  +  ut,  t  —  1,2, , 

where  yt  =  0  for  t  <  0,  ut  is  white  noise  for  t  =  1, 2, . . . ,  and  ut  =  0  for  t  <  0. 
Moreover,  define 


p- 1 

C(z)  :=  (1  -  z)lK  ~  ocP'z  -  Y,  r*(l  -  z)zi 

i=  1 


and  let  the  following  conditions  hold  for  the  parameters: 
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(a)  det  C(z)  =  0  =>  \z\  >  1  or  z  =  1. 

(b)  The  number  of  unit  roots,  z  =  1,  is  exactly  K  —  r. 

(c)  a  and  (3  are  (K  x  r)  matrices  with  rk(a)  =  rk(P)  =  r. 

Then  yt  has  the  representation 

t 

Vt  =  +  S*(L)ut  +  Vq,  (6.3.11) 

i= 1 


where 


-l 

a'x,  (6.3.12) 

S *(L)ut  =  s*jUt-j  an  -^(0)  process  and  j/q  contains  initial  values.  ■ 

Remark  1  The  proposition  is  of  fundamental  importance  because  it  decom¬ 
poses  the  process  yt  into  /(l)  and  1(0)  components  which  have  to  be  treated 
accordingly,  for  example,  when  asymptotic  properties  of  parameter  estima¬ 
tors  are  derived  (see  Chapter  7).  It  makes  precise  under  what  conditions  the 
process  yt  is  driven  by  K  —  r  1(1)  components  and  r  1(0)  components.  The 
representation  in  (6.3.11)  is  a  multivariate  version  of  the  Beveridge-Nelson 
decomposition  of  yt-  The  first  term  on  the  right-hand  side  of  (6.3.11)  consists 
of  K  random  walks  Ui  which  are  multiplied  by  a  matrix  of  rank  I\  —  r, 

denoted  by  3.  Thus,  there  are  actually  K  —  r  stochastic  trends  driving  the 
system.  They  determine  to  a  large  extent  the  development  of  yt ■  Therefore 
one  may  call  yt.  an  /(l)  process  if  there  are  actually  /(l)  trends  (random 
walks)  in  the  representation  (6.3.11).  In  other  words,  yt  is  /(l)  if  it  has  the 
representation  (6.3.11)  with  3^0.  Clearly,  for  3  to  have  the  form  given  in 
(6.3.12),  the  ((K  —  r)  x  (K  —  r))  matrix 


3  =  Pj 


P-1 


a. 


-Er*  Pj 


i=  1 


P-1  \ 

£r<  e 


_L 


must  be  invertible.  Only  under  that  condition,  rk(3)  =  K  —  r.  Therefore 
the  latter  condition  ensures  that  yt  is  actually  driven  by  K  —  r  random  walk 
components.  ■ 

Remark  2  The  parameter  matrices  3*  in  (6.3.11)  are  determined  by  the 
model  parameters.  To  state  the  precise  relation,  we  define 


P  “  P(P'P)- 


(K  x  r), 


Q-= 


rp'  1 
A. 

(KxK) 


so  that  Q  1 


[P  :  PJ, 


l 
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p- 1 

r  (z)  ■.=  ik-J2  r^, 

%—  1 

£,(z)  :=  Q[r(z)P(i  -'*)  -  a*  :  r(z)PJ, 

V 

B(z)  =  ik~J2  ■=  Q^b^q, 

»= i 


and 


(6.3.13) 


©(*) 

3=  0 

Notice  that  5(0)  =  <3_1-B*(0)Q  =  [P  :  P±]Q  =  Ik-  Hence,  B(z)  has  the  rep¬ 
resentation  IK  —  Y^T;'=i  BiZ1  stated  in  (6.3.13).  Moreover,  the  matrix  operator 
0  (z)  can  be  decomposed  as 

0(z)  =  0(1)  +  (1  —  z)@*(z), 

where  expressions  for  the  ©*’s  can  be  found  by  comparing  coefficients  in 
©0)  =  o  ®jzj  ancl 

OO 

0(1)  +  (1  —  z)Q*(z)  =  &(l)  +  ^®*zj(l-z) 

3=0 

OO 

=  (0(l)  +  ©S)  +  E(0t-0}-i)^'- 

i=i 

Hence, 


©o  =  ©(1)  +  ©o 

and 


©,  =  ©*  -©*-!,  *=i,2, — 

Using  the  last  expression,  we  get  by  successive  substitution, 

i 

©,*  =  ©t  +  0*-!  =  53  ©w  +  02 

3  =  1 

i  oo 

=  53 ©<_,•  +  © 0-0(1)  =  -  53  ®i>  *  =  1,2,....  (6.3.14) 

3= 1  j=i+l 

From  these  quantities  the  operator  3*(z)  in  (6.3.11)  can  be  obtained  as 

E*(z)  =  [0*(z)  +  PP'H(z)-1]  (6.3.15) 

(see  the  proof  of  Proposition  6.1).  The  representation  (6.3.11)  will  turn  out  to 
be  useful,  for  example,  in  Chapter  9,  where  structural  VECMs  are  discussed. 
The  coefficient  matrices  E*  of  the  operator  E*(z)  will  then  play  an  important 
role  as  specific  impulse  response  coefficients.  ■ 
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Proof  of  Proposition  6. 1 

The  proof  is  adapted  from  Saikkonen  (2005).  We  use  the  notation  from  Re¬ 
mark  2  and  first  show  that  under  the  conditions  of  Proposition  6.1, 

C(z)  =  Q~1Bm(z)P(z),  (6.3.16) 


where 


P(z) 


■p' 

\lr  0  1 

.  (1  -~)P'_. 

o 

1 

1 

1 

This  representation  is  obtained  by  noting  that 
C(z)  =  [T(z)(l-z)-a^,z}Q-1Q 

=  [r(z)jJ(i  -  z)  :  r(2)px(i  - 

=  [r(s)P(l  -z)-az:  r(a:)P_L(l  -  z)} 

=  <3_1Q[r(^)P(l  -  z)  -  az  :  r(2:)Pj_] 


z)  -  a(3'(3 ±z\Q 

tp'  l 
.  Px. 

■  p' 

.  (i-*j ' 


Clearly,  detP(z)  has  exactly  K  —  r  unit  roots  and,  thus,  detS*(2)  cannot 
have  any  such  roots  so  that  det  13*  (z)  ^  0  for  \z\  <  1  must  hold.  In  other 
words,  13*  (L)  is  an  invertible  operator. 

Now  define 


:=  Q-lP(L)yt  =  pp 'yt  +  $J'±Ayt  (6.3.17) 

and  note  that 

P  'zt  =  P  'yt-  (6.3.18) 

For  the  operator  B(z)  =  Q~1B*(z)Q,  we  have  B(0)  =  Q~1B„(0)Q  =  Ik 
and  det  B(z)  -=f  0  for  \z\  <  1  because  det  B*(z)  has  no  roots  inside  or  on  the 
complex  unit  circle.  Moreover, 

B{L)zt  =  g-1R»(L)QQ~1P(L)j/t  =  C(L)yt  =  ut. 

Thus, 


p 

Zt  =  y~]  BjZt-j  +  Ut  (6.3.19) 

i- 1 

is  a  stable  VAR(p)  process  with  the  same  residual  process  ut  as  yt.  We  know 
from  Chapter  2  that  it  has  an  MA  representation 


OO 

zt  =  -B(L)_1wt  =  &(L)ut  =  ^2 


(6.3.20) 
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As  we  have  seen  in  Remark  2,  the  matrix  operator  ©(2)  can  be  decomposed 
as 

0(*)=0(l)  +  (l-z)0*(z). 

Hence,  we  get  from  (6.3.20), 

zt  =  0(l)ut  +  ®*(L)Aut  =  B(l)~1ut  +  ®*(L)Aut.  (6.3.21) 

Using 

yt  =  Q~1Qyt  =  [ P  :  P±]  =  PP 'yt  +  $±$'±yt 

and,  hence, 

Ayt  =  PP '  Ayt  +  P±P  '±Ayt, 

it  follows  from  (6.3.17)  and  (6.3.18)  that  Ayt  =  zt  —  Zt-\.  Thus, 

P  ±$’j_Ayt  =  PJU- 

Substituting  the  expression  from  (6.3.21)  for  zt  gives 
Ayt  =  Vj'±zt  +  W'Ayt 

=  p±p;R(l)-V  +  ®*(L)Aut  +  pp 'Ayt  :=  wt. 

Solving  for  yt  =  yt-i  +  Wt  results  in 

t 

yt  =  yo  +  ^2  m 
2=1 

t  t  t 

=  Vo  +  P±P±S(1)"1  £  ut  +  ©*(L)  Aut  +  pp'  £  Ayt 

i—  1  2—1  2=1 

t 

=  yo  +  PxPx-6!1)-1^^  +  &*(L)(ut  ~  uo)  +  PP '(yt  -  yo ) 

2=1 

t 

=  Vj'±B(l)-1Y,ui  +  ®*(L)ut  +  W'yt+y*0,  (6.3.22) 

2=1 

where  :=  y0  -  ®*(L)u0  -  PP'?y0-  Using  P ' yt  =  P ' zt,  the  term  PP ' yt  =  PP ' zt 
is  seen  to  have  a  representation 

PP'*t  =  PP'0(L)ut 

and,  thus,  ®*{L)ut  +  PP ' yt  has  an  MA  representation 
3*(L)ut=  [0*(L)  +  Pp,0(T)]Wt. 


256  6  Vector  Error  Correction  Models 


For  the  first  term  on  the  right-hand  side  of  (6.3.22)  we  have 
P±p;s(l)-1  =  p±p'xQ-1B*(l)-1Q 


=  pjlipipji-ccroopj-1 
=  PjjO  :  lK-r][-a  :  r(l)P±]_1 
=  pj_[a(Lr(i)p_L]  V±, 


because 


[-«  :  r(l)Pj_]-1 


(a'oc)  1a'{r(l)p±[a(Lr(l)px]  1a'±-lK} 
[air(i)pJ_]_1a(L 


Hence,  E  =  p±p,±5(l)_1  is  as  stated  in  the  proposition.  Notice  that  the 
invertibility  of  otj_r(l)pj_  follows  from  the  invertibility  of  B(l)  which  in  turn 
is  implied  by  det  B(z)  ^  0  for  \z\  <  1.  ■ 


6.4  Deterministic  Terms  in  Cointegrated  Processes 

In  the  previous  section,  we  have  ignored  deterministic  terms  in  the  DGP. 
Clearly,  deterministic  terms  may  also  be  present  in  cointegrated  processes 
and  VECMs.  Actually,  from  the  discussion  of  the  random  walk  with  drift  it 
should  be  clear  that  deterministic  terms  in  a  VAR  process  with  unit  roots  may 
have  a  different  impact  than  in  a  stable  VAR.  For  example,  an  intercept  term 
in  a  random  walk  generates  a  linear  trend  in  the  mean  of  the  process,  whereas 
an  intercept  term  in  a  stable  AR  process  just  implies  a  constant  mean  value. 
To  explore  the  implications  of  the  deterministic  term,  the  following  model  is 
assumed: 

yt  =  fit  +  xt,  (6.4.1) 

where  Xt  is  a  zero  mean  VAR(p)  process  with  possibly  cointegrated  variables 
and  fit.  stands  for  the  deterministic  term.  For  example,  the  deterministic  term 
may  just  be  a  constant,  fit  =  fio,  or  it  may  be  a  linear  trend  term,  fit  =  /io+pif, 
where  fio  and  fi\  are  fixed  AT-dimensional  parameter  vectors.  Other  possible 
deterministic  terms  that  may  be  included  are  seasonal  dummy  variables  or 
other  dummies  to  account  for  special  events.  The  advantage  of  setting  up  the 
process  in  the  form  (6.4.1)  by  adding  the  deterministic  part  to  the  zero  mean 
stochastic  part  is  that  the  mean  of  the  yt  variables  is  clearly  specified  by 
the  deterministic  term  and  does  not  need  to  be  derived  from  quantities  that 
involve  the  parameters  of  the  stochastic  part  in  addition.  The  disadvantage 
is  that  the  stochastic  part  xt  is  not  directly  observable  in  general.  Therefore, 
for  estimation  purposes,  for  instance,  we  have  to  rewrite  the  process  in  terms 
of  the  observable  yt  s.  We  will  do  so  in  the  following  for  some  cases  of  specific 
interest. 
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It  is  assumed  that  the  DGP  of  xt  can  be  represented  as  a  VECM  such  as 
(6.3.6), 

Axt  —  a(I  Xt—i  T  I ^iAxt—i  T  ■  ■  ■  T  Fp—iAxt—p~\-i  T  ut 

=  IIa’t_i  +  Ti/ixt-i  +  •  •  •  +  Tp-iAxt-p+i  +  Ut-  (6.4.2) 

Considering  now  the  case  of  a  constant  deterministic  term,  yt  =  Mo,  we  have 
Xt  =  yt  —  Mo  so  that  Ayt  =  Axt  and  from  (6.4.2)  we  get 

Ayt  =  ocp '  {yt-\  —  Mo)  +  r \Ayt-\  +  •  •  •  +  T  p_iAyt_p+i  +  ut 

=  aP°  ^  1  +FiAyt-i  +  ■  ■  ■ +Tp-iAyt-p+i  +  Ut 

=  n°j/P_i  +  TxAyt-i  +  •  •  •  +  Vp_1Ayt-v+1  +  uu  (6.4.3) 

where  p°'  :=  [p;  :  t'}  with  t'  :=  — PVo  an  (r  x  1)  vector, 


and  11°  :=  [II  :  is  ( K  x  (Ji+1))  with  v0  :=  —Tlyo  =  OCr' .  Hence,  if  there  is 

just  a  constant  mean,  it  can  be  absorbed  into  the  cointegration  relations.  In 
other  words,  the  constant  mean  becomes  an  intercept  term  in  the  cointegration 
relations.  Of  course,  the  model  can  also  be  written  with  an  overall  intercept 
term  as 

Ayt  =  vq  +  ocp '  yt-\  +  TxAyt_i  +  •  •  •  +  Fp_xAyt_p+i  +  ut 

=  vq  +  Uyt_x  +  TiAyt-i  +  •  •  •  +  Tp-iAyt-p+i  +  Ut ■  (6.4.4) 

Here  v0  cannot  be  an  arbitrary  ( K  x  1)  vector  but  has  to  satisfy  the  indicated 
restrictions  {v$  =  ar')  in  order  to  ensure  that  the  intercept  term  in  this  model 
does  not  generate  a  linear  trend  in  the  mean  of  the  yt.  variables.  By  specifying 
the  deterministic  term  in  additive  form  as  in  (6.4.1),  the  properties  of  the 
mean  of  yt  are  easy  to  see. 

A  process  with  a  linear  trend  in  the  mean,  fit  =  Mo  +  Mi is  another  case  of 
practical  importance.  Using  xt  =  yt  —  Mo  —  MiU  Axt  =  Ayt  —  mi,  and  (6.4.2), 
gives 

Ayt -Hi  =  aP'(yt_i  -  Mo  -  Hiit  -  1))  +  T1(Ayt-i  -  Mi)  H - 

+rp_i(Z\?M_p+i  -  mi)  +  ut  (6.4.5) 

or,  collecting  deterministic  terms, 

Ayt  =  v  +  cc[p  :  r]'}  ^  +  TxAyt-i  +  ■  ■  ■ +  Fp-iAyt-p+i  +  ut 

=  v  +  H+ yf_x  +  TiAyt_i  +  ■  ■  ■  +  Fp_iAyt-p+i  +  ut,  (6.4.6) 

where  v  :=  -IImo  +  (1k  ~F i - Tp_i)mi,  rf  :=  -P'mi,  n+  :=  a[p'  :  rf] 

is  a  ( I\  x  ( K  +  1))  matrix  and 
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Now  the  general  intercept  term  v  is  in  fact  unrestricted  and  can  take  on  any 
value  from  depending  of  course  on  /iq ,  /ti,  and  the  other  parameters. 
In  contrast,  the  trend  term  can  be  absorbed  into  the  cointegration  relations. 
Writing  the  model  with  unrestricted  linear  trend  term  in  the  form 

Ayt  =  vq  +  +  II yt-i  +  TiAyt_i  +  •  •  •  +  r  p_1Ayt_p+i  +  ut, 

the  model  is  actually  in  principle  capable  of  generating  quadratic  trends  in 
the  means  of  the  variables. 

It  is  also  possible,  that  the  trend  slope  parameter  yi  is  orthogonal  to  the 
cointegration  matrix  so  that  P^i  =  0  and,  hence,  y  =  0  and  the  trend  term 
disappears  from  the  cointegration  relations.  This  situation  can  also  occur  if 
yi  ^  0  and  the  variables  actually  have  linear  trends  in  their  means.  The  linear 
trends  will  then  be  generated  via  the  intercept  term  v.  The  resulting  model, 

Ayt  =  v  +  cep  yt- i  +  riZ\yt_i  +  •  •  •  +  T  p-iAyt-p+i  +  ut 

=  v  +  Flyt-i  +  TiAyt-i  H - f  rp^iAyt_p+1  +  utf  (6.4.7) 

with  unrestricted  intercept  term  v  will  be  of  some  importance  later  on.  It  rep¬ 
resents  a  situation  where  a  linear  trend  appears  in  the  variables  but  not  in  the 
cointegration  relations.  Notice,  however,  that  in  this  situation  the  cointegra¬ 
tion  rank  must  be  smaller  than  K.  If  the  process  has  cointegrating  rank  K ,  it 
is  stable  and,  hence,  it  cannot  generate  a  linear  trend  when  just  an  intercept 
is  included  in  the  model.  Formally,  a  “cointegrating  matrix”  P  of  rank  K  is 
nonsingular  so  that  p  cannot  be  zero  if  /zi  is  nonzero. 

It  may  also  be  worth  noting  that  the  specification  of  the  deterministic  com¬ 
ponent  in  additive  form  as  in  (6.4.1)  has  the  additional  advantage  that  the 
Beveridge-Nelson  representation  of  yt  is  obtained  by  adding  the  deterministic 
term  to  the  Beveridge-Nelson  representation  of  xt.  Thus,  a  suitable  gener¬ 
alization  of  the  Granger  representation  theorem  (Proposition  6.1)  is  readily 
available. 


6.5  Forecasting  Integrated  and  Cointegrated  Variables 

If  forecasting  is  the  objective,  the  VAR  form  of  a  process  is  quite  convenient. 
Because  forecasting  the  deterministic  part  is  trivial,  a  purely  stochastic  pro¬ 
cess  will  be  considered  initially.  For  a  VAR(p)  process, 

Vt  =  Aiyt-1  H - 1-  Apyt-p  +  Ut ,  (6.5.1) 

the  optimal  h- step  forecast  with  minimal  MSE  is  given  by  the  conditional 
expectation,  provided  that  expectation  exists,  even  if  det(//f  —  A\z —  •— Apzp ) 
has  roots  on  the  unit  circle.  In  the  proof  of  the  optimality  of  the  conditional 
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expectation  in  Section  2.2.2,  we  have  not  used  the  stationarity  and  stability 
of  the  system.  Thus,  assuming  that  Ut  is  independent  white  noise,  the  optimal 
h- step  forecast  at  origin  t,  is 


Vt(h )  =  Aiyt(h  -  1)  H - b  Apyt(h  -  p ),  (6.5.2) 

where  yt(j )  :=  yt.+j  for  j  <  0,  just  as  in  the  stationary,  stable  case. 

Also  the  forecast  errors  are  of  the  same  form  as  in  the  stable  case.  To  see 
this,  we  write  the  process  (6.5.1)  in  VAR(l)  form  as 

Yt  =  A  Yt_x  +  Ut,  (6.5.3) 

where 


Yt  := 

yt 

'■ 

,  A  := 

A\  A2  ...  Ap_  1  Ap 
IK  0  ...  0  0 

0  1K  0  0 

,  and  Ut  := 

ut 

0 

yt-p+i 

(Kpx  1) 

0  0  ...  IK  0 

(Kpx  Kp) 

_  0  _ 

Kpx  1) 

If  ut  is  independent  white  noise,  the  optimal  h-step  forecast  of  Yt  is 
Yt(h)  =  AYt(h-l)  =  AhYt. 

Moreover, 

Yt+h  =  AYt+h-\  +  Ut+h 

=  A  hYt  +  Ut+h  +  AUt+h-i  +  •  •  •  +  A1'  1Ut+ 1- 
Hence,  the  forecast  error  for  the  process  Yt  is 

Yt+h  —  Yt(h)  =  Ut+h  +  AUt+h-i  +  •  •  •  +  Ah  1Ut+ 1- 

Premultiplying  by  the  ( K  x  Kp)  matrix  .7  :=  [I K  :  0  :  •  •  •  :  0]  gives 

Ut+h  ~  Vt{h)  =  JUt+h  +  JA J' JUt+h-i  +  •  •  •  +  JAh  1J'.JUt+ 1 

=  Ut+h  +  ^lUt+h-l  +  ■  •  •  +  <&h-l'Ut+l{  (6.5.4) 

where  J'  JUt  =  Ut  and  =  JAlJ'  have  been  used.  Thus,  the  form  of  the 
forecast  error  is  exactly  the  same  as  in  the  stable  case  and  the  forecast  is 
easily  seen  to  be  unbiased,  that  is, 


E[yt+h  ~  Vt{h )]  =  0. 


Furthermore,  the  A(’s  may  be  obtained  from  the  A,’s  by  the  recursions 

i 

(b  =  \  d>  A 
3= 1 


*  =  1,2,..., 


(6.5.5) 
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with  P0  =  Iki  just  as  in  Chapter  2.  Also  the  forecast  MSE  matrix  becomes 


h- 1 

Sy{h)  =  Yj^U^ 


2=0 


(6.5.6) 


as  in  the  stable  case.  Yet  there  is  a  very  important  difference.  In  the  stable 
case,  the  ^.j’s  converge  to  zero  as  i  — >  oo  and  £y(h)  converges  to  the  covariance 
matrix  of  yt  as  h  — »  oo.  This  result  was  obtained  because  the  eigenvalues  of 
A  have  modulus  less  than  one  in  the  stable  case.  Hence,  (Pl  =  JAlJ'  — »  0 
as  i  — >  oo.  Because  the  eigenvalues  of  A  are  just  the  reciprocals  of  the  roots 
of  the  determinantal  polynomial  det (Ik  —  A\z  —  ■  ■  ■  —  Apzp),  the  <£,’s  do  not 
converge  to  zero  in  the  presently  considered  unstable  case  where  one  or  more 
of  the  eigenvalues  of  A  are  1.  Consequently,  some  elements  of  the  forecast  MSE 
matrix  £y(h)  will  approach  infinity  as  h  — »  oo.  In  other  words,  the  forecast 
MSEs  will  be  unbounded  and  the  forecast  uncertainty  may  become  extremely 
large  as  we  make  forecasts  for  the  distant  future,  even  if  the  structure  of  the 
process  does  not  change. 

To  illustrate  this  point,  consider  the  following  bivariate  VAR(l)  example 
process  with  cointegrating  rank  1: 


Vit 

V2t 


0  1 
0  1 


2/i,t— i 
2/2,*— 1 


Uu 

U2t 


(6.5.7) 


The  corresponding  VECM  representation  is 


Ayt  =  - 
that  is, 


1  -1 

0  0 


yt- 1  +  ut  = 


-1 

0 


[1,  — 1]2/*— 1  +  Ui> 


a  = 


-1 

0 


For  this  process,  it  is  easily  seen  that  =  h  and 


=A(  =  [  o  x  J  .  j  =  1,2,.... 

which  implies 

h- 1 

£y(h)  =  J2  ‘PjZu&j  =  £u  +  (h-  1) 

3=0 


(J  o  (7  o 


h  =  1,2,..., 


where  a\  is  the  variance  of  u^t-  Moreover,  the  conditional  expectations  are 
Vk,t(h)  =  2/2,*  (A;  =  1,2).  Hence,  the  forecast  intervals  are 


2/2,*  -  Z(a/2)yj^k  +  (h-  1  )ct|,  2/2,*  +  Z(a/2)\Jo%  + 


k  =  1,2, 
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where  Z(a/2)  is  the  (1  —  100  percentage  point  of  the  standard  normal  dis¬ 

tribution.  It  is  easy  to  see  that  the  length  of  this  interval  is  unbounded  for 
h  — >  oo. 

If  there  are  cointegrated  variables,  some  linear  combinations  can  be  fore¬ 
casted  with  bounded  forecast  error  variance,  however.  To  see  this,  multiply 
(6.5.7)  by 

'  1  -1  ' 

0  1  ' 

Thereby  we  get 


'  1  -1  ' 

O 

O 

'  1  -1 ' 

0  1 

yt  = 

0  1 

yt-i  + 

0  1 

Ut 

which  implies  that  the  cointegration  relation  zt  ~  yu  —  yit  =  Uit  —  «-2t  is  zero 
mean  white  noise.  Thus,  the  forecast  intervals  for  zt  for  any  forecast  horizon 
h  are  of  constant  length, 

[zt{h)  -  Z(a/2)<Jz{h),  Zt(h)  +  Z(a/2)CTz(h)]  =  \-Z(a/2)Vz,  Z(a/2)<Jz], 

where  cr^  :=  Var(ttit)  +  Var(u,2t)  —  2Cov(itit,  U2t)  is  the  variance  of  Zt  and 
zt(h)  =  0  for  h  >  1  has  been  used. 

If  deterministic  terms  are  present,  we  may  use  the  foregoing  formulas  for 
the  mean-adjusted  variables  and  then  add  the  deterministic  terms  for  the 
forecast  period  to  the  mean-adjusted  forecasts.  More  precisely,  if  yt.  =  yt.  +  xt, 
where  yt  is  the  deterministic  term  and  xt  is  the  stochastic  part,  a  forecast 
for  yt+h.  is  obtained  from  a  forecast  Xt{h)  for  Xt+h  by  simply  adding  yt+h , 
yt(h)  =  yt+h.  +  xt(h).  By  the  very  nature  of  a  deterministic  term,  yt+h  is 
known,  of  course. 

In  practice,  the  parameters  Ai, . . .  ,Ap,  Su,  and  and  those  of  the  deter¬ 
ministic  part  are  usually  unknown.  The  consequences  of  replacing  them  by 
estimators  will  be  discussed  in  Chapter  7. 


6.6  Causality  Analysis 


From  the  discussion  in  the  previous  subsection,  it  follows  easily  that  the  re¬ 
strictions  characterizing  Granger-noncausality  are  exactly  the  same  as  in  the 
stable  case.  More  precisely,  suppose  that  the  vector  yt  in  (6.5.1)  is  partitioned 
in  M-  and  ( K  —  M)-dimensional  subvectors  Zt  and  Xt, 


Vt 


zt 

Xt 


and  Ai 


'  A\2  + 

A  21, i  A22  + 


where  the  A,:  are  partitioned  in  accordance  with  the  partitioning  of  yt..  Then 
Xt  does  not  Granger-cause  zt  if  and  only  if 
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Ai2,i  =  0,  i  = 


(6.6.1) 


In  turn,  zt  does  not  Granger-cause  Xt  if  and  only  if  An,*  =  0  for  i  =  1, . . .  ,p. 
It  is  also  easy  to  derive  the  corresponding  restrictions  for  the  VECM, 


Azt 

Axt 


nu 

1I12 

Zt- 1 

p- 1 

I  V 

Tny 

ri2,< 

Azt-i 

1I21 

1I22 

i= 1 

_  r2M 

^22,i 

Axt—i 

where  all  matrices  are  partitioned  in  line  with  yt.  From  (6.3.6)  it  follows 
immediately,  that  the  restrictions  in  (6.6.1)  can  be  written  equivalently  as 


IIi2  =  0  and  ri2,i  =  0  for  *  =  1, . . .  ,p  —  1. 


(6.6.2) 


In  other  words,  in  order  to  check  Granger-causality,  we  just  have  to  test  a  set 
of  linear  hypotheses.  It  will  be  seen  in  the  next  chapter  that  in  the  case  of 
cointegrated  processes,  testing  these  restrictions  is  not  as  straightforward  as 
for  stationary  processes. 

Also  restrictions  for  multi-step  causality  and  instantaneous  causality  can 
be  placed  on  the  VAR  coefficients  and  the  residual  covariance  matrix  in  the 
same  way  as  in  Chapter  2.  Especially  for  the  former  restrictions,  constructing 
valid  asymptotic  tests  is  not  straightforward,  however. 


6.7  Impulse  Response  Analysis 


Integrated  and  cointegrated  systems  must  be  interpreted  cautiously.  As  men¬ 
tioned  in  Section  6.3,  in  cointegrated  systems  the  term  (3 ' yt  is  usually  thought 
of  as  representing  the  long-run  equilibrium  relations  between  the  variables. 
Suppose  there  is  just  one  such  relation,  say 


Pi2/it  +  •  •  •  +  P  KVKt  —  0, 


or,  if  P-l  ^  0, 


Vit  =  - 


Pk 

Pi 


VKt- 


It  is  tempting  to  argue  that  the  long-run  effect  of  a  unit  increase  in  2/2  will 
be  a  change  of  size  P2/Pi  'n  Vi  ■  This,  however,  ignores  all  the  other  rela¬ 
tions  between  the  variables  which  are  summarized  in  a  VAR(p)  model  or  the 
corresponding  VECM.  A  one-time  unit  innovation  in  2/2  may  affect  various 
other  variables  which  also  have  an  impact  on  7/1  ■  Therefore,  the  long-run  ef¬ 
fect  of  a  ^-innovation  on  2/1  may  be  quite  different  from  —  P2 /Px .  The  impulse 
responses  may  give  a  better  picture  of  the  relations  between  the  variables. 

In  Chapter  2,  Section  2.3.2,  the  impulse  responses  of  stationary,  stable 
VAR(p)  processes  were  shown  to  be  the  coefficients  of  specific  MA  repre¬ 
sentations.  An  unstable,  integrated  or  cointegrated  VAR(p)  process  does  not 
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possess  valid  MA  representations  of  the  types  discussed  in  Chapter  2.  Yet  the 
<Pi  and  Oi  matrices  can  be  computed  as  in  Section  2.3.2.  For  the  s  we  have 
seen  this  in  Section  6.5  and,  from  the  discussion  in  that  section,  it  is  easy  to 
see  that  the  elements  of  the  <Pi  =  (<j>jk,i)  matrices  may  represent  impulse  re¬ 
sponses  just  as  in  the  stable  case.  More  precisely,  <j)jk,i  represents  the  response 
of  variable  j  to  a  unit  forecast  error  in  variable  fc,  i  periods  ago,  if  the  system 
reflects  the  actual  responses  to  forecast  errors.  Recall  that  in  stable  processes 
the  responses  taper  off  to  zero  as  i  — >  oo.  This  property  does  not  necessarily 
hold  in  unstable  systems  where  the  effect  of  a  one-time  impulse  may  not  die 
out  asymptotically. 

In  Section  2.3,  we  have  also  considered  accumulated  impulse  responses, 
responses  to  orthogonalized  residuals  and  forecast  error  variance  decomposi¬ 
tions.  These  tools  for  structural  analysis  are  all  available  for  unstable  systems 
as  well,  using  precisely  the  same  formulas  as  in  Chapter  2.  The  only  quantities 
that  cannot  be  computed  in  general  are  the  total  “long-run  effects”  or  total 
multipliers  'J/ao  and  because  they  may  not  be  finite. 

To  illustrate  impulse  response  analysis  of  cointegrated  systems,  we  consider 
the  following  VECM: 


ARt 

'  -0.07  ' 

'  0.24  -0.08  ' 

ARt_i 

ADpt 

— 

0.17 

(Rt- i  -  4Dpt_i)  + 

0  -0.31 

ADpt- 1 

'  0  -0.13  ' 

ARt- 2 

'  0.20  -0.06  ' 

ARt- 3 

«i,t 

0  -0.37 

ADpt- 2 

+ 

0  -0.34 

ADpt- 3  _ 

+ 

_  U2,t  _ 

(6.7.1) 


2.61  —0.15 
-0.15  2.31 


x  10"5 


and  the  corresponding  correlation  matrix  is 


R 


U 


1  -0.06 

-0.06  1 


This  model  is  from  Lutkepohl  (2004,  Eq.  (3.41)).  The  variables  are  a  long¬ 
term  interest  rate  ( Rt )  and  the  quarterly  inflation  rate  ( Dpt ).  The  coefficients 
are  estimated  from  quarterly  German  data.  Deterministic  terms  have  been 
deleted  because  they  are  not  important  for  the  present  analysis. 

In  contrast  to  the  inflation/interest  rate  example  system  considered  in 
Chapter  2,  the  two  variables  in  the  present  system  are  1(1).  The  cointegra¬ 
tion  relation,  Rt  —  4 Dpt ,  is  just  the  real  interest  rate  because  4 Dpt  is  the 
annual  inflation  rate  and  Rt  is  an  annual  nominal  interest  rate.  Thus,  in  the 
present  model  the  real  interest  rate  is  stationary.  This  relation  is  sometimes 
called  the  Fisher  effect.  The  zero  restrictions  have  been  determined  by  a  sub¬ 
set  modelling  algorithm.  The  residual  covariance  matrix  is  almost  diagonal. 
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Fig.  6.4.  Forecast  error  impulse  responses  of  VECM  (6.7.1). 


Therefore,  forecast  error  impulse  responses  should  be  similar  to  orthogonalized 
impulse  responses,  except  for  the  scaling.  The  two  types  of  impulse  responses 
are  shown  in  Figures  6.4  and  6.5,  respectively.  Indeed,  the  shape  of  correspond¬ 
ing  impulse  responses  in  the  two  figures  is  quite  similar.  A  remarkable  feature 
of  the  impulse  responses  is  that  they  do  not  die  out  to  zero  when  the  time 
span  after  the  impulse  increases  but  approach  some  nonzero  value.  Clearly, 
this  reflects  the  nonstationarity  of  the  system  where  a  one-time  impulse  can 
have  permanent  effects. 

Using  the  orthogonalized  impulse  responses,  it  is  also  possible  to  compute 
forecast  error  variance  decompositions  based  on  the  same  formulas  as  in  Chap¬ 
ter  2,  Section  2.3.3.  For  the  example  system,  they  are  shown  in  Figure  6.6. 
They  look  similar  to  forecast  error  variance  decompositions  from  a  stationary 
VAR  process.  Of  course,  there  is  no  reason  why  they  should  look  differently 
than  in  the  stationary  case. 

As  discussed  in  Chapter  2,  interpreting  the  forecast  error  and  orthogonal¬ 
ized  impulse  responses  used  here  is  often  problematic  if  there  is  significant 
correlation  between  the  components  of  the  residuals  ut  .  It  will  be  discussed  in 
Chapter  9  how  identifying  restrictions  for  impulse  responses  can  be  imposed 
in  the  VECM  framework. 
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Fig.  6.5.  Orthogonalized  impulse  responses  of  VECM  (6.7.1). 


6.8  Exercises 


Problem  6.1 
Consider  the  process 


Vt 


1  0 

o  V’ 


Vt-i  +  Ut 


with  residual  covariance  matrix 


1 

P  1 


(a)  What  is  the  cointegrating  rank  of  the  process? 

(b)  Write  the  process  in  VECM  form. 


Problem  6.2 

Determine  the  roots  of  the  reverse  characteristic  polynomial  and,  if  applicable, 
the  cointegrating  rank  of  the  process 


Vt 


1.1  -0.2 
-0.2  1.4 


Vt- 1  +  ut- 


Can  you  write  the  process  in  VECM  form? 
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forecast  error  of  ’R’ 


Time  Index 


Time  Index 


Fig.  6.6.  Forecast  error  variance  decomposition  of  VECM  (6.7.1). 


Problem  6.3 

What  is  the  maximum  possible  cointegrating  rank  of  a  three-dimensional  pro¬ 
cess  yt  =  (2/14,2/2*,  2/3*)', 

(a)  if  yit,  2/2*  are  1(0)  and  y3t  is  1(1)7 

(b)  if  yit,  yit,  and  y3t  are  /(l)  and  y\t  and  yit  are  not  cointegrated  in  a 
bivariate  system? 

(c)  if  yu,  y2t,  and  y3t  are  1(1)  and  (yu,  y2t)'  and  (y2t,  2/3*)'  are  not  cointegrated 
as  bivariate  systems? 

Problem  6. 4 

Find  the  Beveridge-Nelson  decomposition  associated  with  the  VECM 
Ayt  =  ap't/t-i  +  ut, 

(a)  if  all  initial  values  are  zero  (yt  =  ut  =  0  for  t  <  0), 

(b)  if  yo  is  nonzero. 

Problem  6.5 

Derive  the  VECM  form  of  yt  if  the  deterministic  term  is  yt  =  yo  +  &I{t>Ta), 
where  I(t>TB)  is  a  shift  dummy  variable  which  is  zero  up  to  time  1'rs  and  then 
jumps  to  one  and  5  is  the  associated  (K  x  1)  parameter  vector. 
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Problem  6.6 

Consider  the  quarterly  process  yt  =  IM  +  Xt,  where  Xt  has  a  VECM  represen¬ 
tation  as  in  (6.4.2)  and 


Mt  —  MO  +  MH  +  ^lSlt  +  <52S2t  +  ^3  • 


Here  y o,  M i,  <5i ,  62,  and  ^3  are  A'-dimensional  parameter  vectors  and  the  su’s 
(i  =  1,  2,  3)  are  seasonal  dummy  variables.  Determine  the  VECM  representa¬ 
tion  of  yt- 

Problem  6.7 
Consider  the  VECM 


Ayt 


-0.1 

0.1 


(1,  +  Uf 


(a)  Rewrite  the  process  in  VAR  form. 

(b)  Determine  the  roots  of  the  reverse  characteristic  polynomial. 

(c)  Determine  forecast  intervals  for  the  two  variables  for  forecast  horizon  h. 

(d)  Has  a  forecast  error  impulse  in  y\t  a  permanent  impact  on  3/2*?  Has  a 
forecast  error  impulse  in  y2t  a  permanent  impact  on  yu7 


Estimation  of  Vector  Error  Correction  Models 


In  this  chapter,  estimation  of  VECMs  is  discussed.  The  asymptotic  properties 
of  estimators  for  nonstationary  models  differ  in  important  ways  from  those 
of  stationary  processes.  Therefore,  in  the  first  section,  a  simple  special  case 
model  with  no  lagged  differences  and  no  deterministic  terms  is  considered  and 
different  estimation  methods  for  the  parameters  of  the  error  correction  term 
are  treated.  For  this  simple  case,  the  asymptotic  properties  can  be  derived 
with  a  reasonable  amount  of  effort  and  the  difference  to  estimation  in  station¬ 
ary  models  can  be  seen  fairly  easily.  Therefore  it  is  useful  to  treat  this  case  in 
some  detail.  The  results  can  then  be  extended  to  more  general  VECMs  which 
are  considered  in  Section  7.2.  In  Section  7.3,  Bayesian  estimation  including  the 
Minnesota  or  Litterman  prior  for  integrated  processes  is  discussed  and  fore¬ 
casting  and  structural  analysis  based  on  estimated  processes  are  considered 
in  Sections  7. 4-7. 6. 


7.1  Estimation  of  a  Simple  Special  Case  VECM 

In  this  section,  a  simple  VECM  without  lagged  differences  and  deterministic 
terms  is  considered.  More  precisely,  the  model  of  interest  is 

Ayt  =  nyt_i  +  ut  =  ap'yt-i  +  ut,  t  =  1,2,...,  (7.1.1) 

where  yt  is  -dimensional,  II  is  a  ( K  x  K)  matrix  of  rank  r,  0  <  r  <  K, 
a  and  (3  are  ( K  x  ?’)  with  rank  r,  and  ut  is  K -dimensional  white  noise  with 
mean  zero  and  nonsingular  covariance  matrix  Su.  For  simplicity,  we  assume 
that  Ut  is  standard  white  noise  so  that  certain  limiting  results  hold  which  will 
be  discussed  and  used  in  the  following.  For  the  time  being,  the  initial  vector 
yo  is  arbitrary  with  some  fixed  distribution.  We  also  assume  that  yt  is  an  7(1) 
vector  so  that  we  know  from  Section  6.3  that  the  (( K  —  r)  x  ( K  —  r))  matrix 
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is  invertible  (see  Eq.  (6.3.12)).  Here  OM  and  (3±  are,  as  usual,  orthogonal 
complements  of  a  and  p,  respectively. 

The  cointegration  rank  r  is  assumed  to  be  known  and  it  is  strictly  between 
0  and  K .  For  r  =  0,  Ayt  is  stable  and  for  r  =  K,  yt  is  stable.  For  the  present 
purposes,  these  two  boundary  cases  are  of  limited  interest  because  they  can 
be  treated  in  the  stationary  framework  considered  in  Part  I.  If  r  is  not  known, 
however,  it  may  be  of  interest  to  consider  the  case  r  =  0.  The  matrix  II  is 
then  zero,  of  course.  We  will  comment  on  this  case  at  the  end  of  this  section. 

We  will  discuss  different  estimators  of  the  matrix  II,  assuming  that  a 
sample  yi, ...  ,yx  and  a  presample  vector  yo  are  available.  Our  first  estimator 
is  the  unrestricted  LS  estimator, 

t  \  /  t  \  _1 

J2Ayty't- lj  •  (7-L2) 

Substituting  Hyt-i  +  ut  for  Ayt  gives 

n-n  =(E“.»:-i)  (g 


-1 


yt-iyt- 1 


(7.1.3) 


To  derive  the  asymptotic  distribution  of  this  quantity,  we  multiply  from  the 
left  with  the  (K  x  K)  matrix 


and  from  the  right  by 


Q  1  =  [a(P'a)  1  :  PuWPu)  ^ 


which  yields 

Q(n  —  n)Q_1  =  Qt^utyt-i^Q'Q-1' Q 


-i 


vt=l 


vt=l 


-  E  Vt4-1  E 


(7.1.4) 


where  vt  :=  Qut  and  zt  :=  Qyt ■  Notice  that  invertibility  of  ct'±p±  follows  from 
our  assumption  of  an  1(1)  system,  as  mentioned  earlier,  and  it  implies  that 
the  inverse  of  Q  exists  because 


[P  :  Pul 


P'P  o 
«±P  «uP± 


is  invertible  if  oc^pj^  is  nonsingular.  Hence,  Q  must  be  invertible  and,  thus, 
P'a  is  also  nonsingular. 
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Premultiplying  the  VECM  (7.1.1)  by  Q  shows  that 


Azt  =  QUQ  1zt- i+vt 


P'a  0 
0  0 


Zt-1  +  vt. 


Hence,  denoting  the  first  r  components  of  zt  by  z^\  we  know  that  z^  =  (3 ' yt 
consists  of  the  cointegrating  relations  and  is  therefore  stationary  while  the  last 
K  —  r  components  of  zt,  denoted  by  z\  ,  constitute  a  ( K  —  r)-dimensional 
random  walk  because  Az\  '  is  white  noise.  Thus,  stationary  and  nonstationary 
components  are  separated  in  zt.  To  derive  the  asymptotic  properties  of  the 
LS  estimator,  it  is  useful  to  write 


g(n-n)Q-1 

-  j,  rp 

sr „ ,(!)' .  v  „  y2)' 

—  2-^VtZt~l ■  /  , vtzt- 1 
_t= i  t=i 


E^i'  E^ 


(i) 


i‘i-i 


V  r(2)  7(1)l  r(2)  7(2)' 


(7.1.5) 


For  the  cross  product  terms  in  this  relation,  we  have  the  following  special  case 
results  from  Ahn  &  Reinsel  (1990). 


Lemma  7.1 

T 


(i)  t -1  E  =  t-1  E  p Vm-iP  -  yE 


t= l 


t= l 


(2)  T-1/2vec  ^-^(O 

where  Sv  :=  QSUQ'  is  the  covariance  matrix  of  Vt- 

(3)  T_1  E  vtzt-i  4  K12  (  1 1  W/frfWj 


'  0 

/ 

I K—r 

'  KUW  K 

/ 

where  W k  abbreviates  a  standard  Wiener  process  W /c(s)  of  dimension 
A'  (see  Appendix  C.8.2). 

T 


(4)  T-3/2^E 


(1)  (2)/  P 


rt-i 


0. 


(5)  T-2  E  4  [0  :  hc-AEy*  (jf 


WKW'Kds  |  vy2 


0 

I  K—r 


The  quantities  in  (2),  (3),  and  (5)  converge  jointly. 


In  this  lemma  we  encounter  asymptotic  distributions  of  random  matri¬ 
ces.  As  in  Appendix  C.8.2,  these  are  understood  as  the  limits  in  distribution 
of  the  vectorized  quantities.  Because  the  asymptotic  distributions  are  also 
conveniently  stated  in  matrix  form,  not  using  vectorization  here  is  a  useful 
simplification.  Moreover,  in  the  lemma  as  well  as  in  the  following  analysis 
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we  denote  the  square  root  of  a  positive  definite  matrix  E  by  A1/2,  that  is, 
A1/2  is  the  positive  definite  symmetric  matrix  for  which  A^A1/2  =  A  (see 
Appendix  A. 9. 2). 

Proof:  The  proof  follows  Ahn  &  Reinsel  (1990).  Lemma  7.1(1)  is  implied  by 
a  standard  weak  law  of  large  numbers  (see,  e.g.,  Proposition  C.12(7))  because 
z^1  contains  stationary  components  only. 

The  second  result  also  involves  stationary  processes  only.  Therefore  it  fol¬ 
lows  from  a  martingale  difference  central  limit  theorem  for  stationary  pro¬ 
cesses.  Notice  that  vec {vtz^}[)  is  a  martingale  difference  sequence  and,  hence, 
a  martingale  difference  array  which  satisfies  the  conditions  of  Proposition 
C.13(2).  Thus,  the  result  follows  from  that  proposition. 

To  show  Lemma  7.1(3),  we  define  a  random  walk 


2/  = 


*(1) 

zt. 

J2) 


=  Z, 


+  Vt,  t  —  1,2,. 


with  z0y  =  0  and  notice  that  the  second  part  of  z%  is  identical  to  the  last 
K  —  r  components  of  zt.  Hence,  it  follows  from  Proposition  C.18(6)  that 


t=  1 


E}/2 


WKdW'K)  E}/2. 


Considering  the  last  K  —  r  columns  only  gives  the  desired  result. 
Part  (4)  of  the  lemma  can  be  shown  by  defining 


4  = 


+(i) 

h 

,(2) 


=  Z, 


t- 1 


■vt, 


t  =  1,2,..., 


with  zt^  =  0  and 


„+  _ 


,(!) 

't 

.,(2) 


Thus,  vt  is  an  1(0)  process.  By  Proposition  C.18(5),  we  have 


T 

zt 

E  A1;1-!21'  * 

vt'  = 

t= i 

t 

sr  J2) 

/  ,  zt-izt 

Op(T), 


which  implies  the  desired  result. 

Lemma  7.1(5)  is  just  a  special  case  of  Proposition  C.18(9)  because  z\  is  a 

(2) 

random  walk  and  the  last  K  —  r  components  of  z$  are  just  zt  ' . 

Finally,  the  joint  convergence  of  the  quantities  in  Lemma  7.1(2),  (3),  and 
(5)  follows  because  all  quantities  are  eventually  made  up  of  the  same  ut’s.  ■ 
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The  lemma  implies  the  following  limiting  result  for  the  LS  estimator  II. 

Result  1 

Let 


D  = 


Ti/2  o 

0  T 


Then 


vec[Q(n  -  n )Q~1D] 

"A7(o .(rP)-1®^) 

vec |  U1/2  WKdW'j^j '  Sl/2 

x  (jO  :  lK-r]El/2  ( fg  WKW'Kds) 


0 

1  K—r 


z, 


1/2 


0 

i  K-r 


(7.1.6) 


Proof: 

Q{fl-Il)Q-lD 


t=l 


t=l 


xD 


V  A1)  A1)'  V  A1}  A2)' 

2_^zt-izt-i  2.^1  zt-izt-i 


-i 


V  A2)  r(1)'  V  A2)  A2)' 

/  ,  Zt-lZt-l  /  <Zt- lZt-l 
Li  t 


D 


-1 


r-i/2£».4ii)(T-'£4iUi>; 


t=l 


t=  1 


£=1 


+  op(l). 


The  last  equality  follows  from  Lemma  7.1(4).  The  result  in  (7.1.6)  is  obtained 
by  vectorizing  this  matrix  and  applying  Lemma  7.1(2),  (3),  and  (5)  and  the 
continuous  mapping  theorem  (see  Appendix  C.8).  ■ 

An  immediate  implication  of  Result  1  follows. 
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Result  2 

The  estimator  II  is  asymptotically  normal, 

v/Tvec(n  -  n)  4  AA  (o,  p(ri1))-1p'  ®  ,  (7.1.7) 

and  P(rz(1))-1p'  can  be  estimated  consistently  by 


Proof: 

Vtq(  n  -  n)^-1 

=  Q(n-n)Q~1i? 


1  0 
0  T"1/2 


-1 


t=l 


t=l 


:T-l/2 

V  t=i  j 


r(2)  J2)/ 

H- lzt-l 


°p(l) 


from  the  proof  of  Result  1  and,  hence, 

v4vec[Q(n  -  n)Q"1]  =  (Q-1'  ®  Q)v/Tvec(fl  -  II) 

4  r  Ar(o.  (/I0)  ' 

0 


Premultiplying  by  Q'  ®  Q  1  and  recalling  the  definition  of  <5,  gives  a  multi¬ 
variate  normal  limiting  distribution  with  covariance  matrix 


0 Q'®Q~l ) 


(rP)-1  o 

o  o 


(Q&Q-1') 


or 


[P  :  ax] 


'  (ifY1  o' 

'P' 

0  0 

a!± 

<Q~1EVQ-1' 


which  implies  (7.1.7)  because  Uv  =  QEuQf . 
Now  consider 


T-1  £  yt-iy't-,  =Q'[t~1Y,  zt-iz't.,  Q 


t= i 


t=i 


-i 
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=  Q' 


=  Q' 


T-l  V  r(1)  r(1)'  T"1  V  -v(1)  r(2)' 


c— 1  _L  c—  lc  o*  c  c— 1  c— 1  c  o* 

On  o11  O12O  0210-n  —0-11  O12O 

-S*S2iS- 


-1 


Q 


Q, 


where  the  rules  for  the  partitioned  inverse  from  Appendix  A.  10  have  been 
used  and  S*  :=  (A^1  —  A2iAn1Ai2)_1.  Moreover, 

An  :=  T-1  4-i4-i  ^  ri1} 

t 

by  Lemma  7.1(1), 


S12  =  S'2 1  :=  T-1  £  =  op(T1/2) 


by  Lemma  7.1(4),  and 
c  ._  T-1  V  A2)  A2)' 

*22  •-  J  /  ,Zt-lzt-l- 


By  Lemma  7.1(5)  and  the  continuous  mapping  theorem,  A221  =  Op(T  :). 
Using  again  the  rules  for  the  partitioned  inverse  from  Appendix  A.  10, 

A*  =  S22  +  S22  S2i{Sii  —  S12S22  S2\)~1  S12S22 

=  Op{T~l)  +  Op(T~1)op(T1^2)Op(l)op(T1^2)Op(T~1) 

=  Op{T~l), 


because 

An  -  Ai2A221A21  =  An  -  Op(T1/2)Op(T~1)op(T1/2)  =  An  +  op(l) 

so  that 

(An-  Ai2A2-21A2i)-1  =  Op(1). 


Hence,  we  get 

Aft1  +  An1Ai2A*A2iAn1  =  (r^)"1 

+Op(l)op(T1/2)Op(T-1)op(y/2)Op(l) 

=  (ri1))-1  +  0p(  1) 

and 


Aii'AuA*  =  Op(1)op(T1/2)Op(T-1)  =  op(l). 
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Thus, 

(r-'E 

which  proves  Result  2.  ■ 

Thus,  the  limiting  distribution  of  %/Tvec(II  —  II)  is  singular  because  r-1'1 
is  an  (r  x  r)  matrix.  Still,  we  can  use  the  usual  estimator  of  the  covariance 
matrix  based  on  the  regressor  matrix.  Thus,  f-ratios  can  be  set  up  in  the 
standard  way  and  have  their  usual  asymptotic  standard  normal  distributions, 
if  a  consistent  estimator  of  Uu  is  used.  In  Result  8,  we  will  see  that  the  usual 
residual  covariance  matrix  is  in  fact  a  consistent  estimator  for  Su,  as  in  the 
stationary  case.  On  the  other  hand,  it  is  not  difficult  to  see  that  the  covariance 
matrix  in  the  limiting  distribution  (7.1.7)  has  rank  rK.  Therefore,  setting  up 
a  Wald  test  for  more  general  restrictions  may  be  problematic.  As  explained 
in  Appendix  C.7,  a  nonsingular  weighting  matrix  is  needed  for  the  Wald  test 
to  have  its  usual  limiting  ^-distribution  under  the  null  hypothesis.  Thus,  if 
we  want  to  test,  for  example, 


Hy  :  II  =  0  versus  Hi  :  II  ^  0, 
the  corresponding  Wald  statistic  is 


\w  =  Tvec(h)1  T  1  ^  yt-iy't-i 


vec(II). 


Under  the  arguments  in  the  proof  of  Result  2  can  be  used  to  show  that 
T-1  Yl't—i  Ut-iVt-i  converges  to  zero  in  probability  and,  hence,  the  limit  of 
the  weighting  matrix  in  the  Wald  statistic  is  singular.  Thus,  \w  will  not  have 
an  asymptotic  y2 ( A2 (-distribution.  Therefore,  caution  is  necessary  in  setting 
up  A-tests,  for  example.  In  the  nonstationary  case,  they  may  not  have  an 
asymptotic  justification.  We  will  provide  more  discussion  of  this  problem  in 
Section  7.6  in  the  context  of  testing  for  Granger-causality. 

It  is  interesting  to  note  that  the  asymptotic  distribution  in  (7.1.7)  is  the 
same  one  that  is  obtained  if  the  cointegration  matrix  (3  is  known  and  only  a 
is  estimated  by  LS.  To  see  this  result,  we  consider  the  LS  estimator 


=  £  Ayty't- iP  )  (  PVt-wLiP 


(7.1.8) 


This  estimator  has  the  following  properties. 
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Result  3 

VTve c(S  -  a)  4  A/"(0,  (rj1))"1  <g>  27u)  (7.1.9) 

and,  thus, 

Vrvec(ap'  -  n)  4  7V(o,  p(rW)-1p'  ®  su). 


Proof:  Substituting  a$'yt-i  +  Ut  for  Ayt  in  (7.1.8)  and  rearranging  terms 
gives 

a- a  =  ^XJPV^yt-iP 

from  which  we  get  (7.1.9)  by  similar  arguments  as  in  the  proof  of  Lemma  7.1. 
Noting  that  vec(ap/  —  II)  =  (p  ®  IK)vec(a  —  a),  gives  the  stated  asymptotic 
distribution  of  v/T'vec(aP  —  II).  ■ 

Clearly,  this  result  may  seem  a  bit  surprising  because  it  means  that  knowl¬ 
edge  of  p  does  not  improve  our  estimator  for  II,  at  least  asymptotically.  In 
turn,  not  knowing  p  does  not  lead  to  a  reduction  in  asymptotic  precision  of 
our  estimator.  This  is  a  consequence  of  the  fact  that  p  can  be  estimated  with 
a  better  convergence  rate  than  \/T.  To  see  this  fact,  suppose  for  the  moment 
that  a  is  known  and  that  P  is  normalized  as  in  (6.3.9)  such  that 


P 


Ir 

P  (K—r) 


(7.1.10) 


We  know  from  the  discussion  in  Section  6.3  that  this  normalization  is  always 
possible  if  the  variables  are  arranged  appropriately.  Thus,  upon  normalization, 
the  only  unknown  elements  of  P  are  in  the  (( K  —  r)  x  r)  matrix  p (K—r)  -  This 
matrix  can  be  estimated  from 


Ayt  ~  a Vt-1  =  u^K-rpt-i  +  ut  =  (Vt- 1  ®  a)vec(p'(A-_r))  +  uu  (7.1.11) 

where  y[1}1  and  consist  of  the  first  r  and  the  last  K  —  r  elements  of  yt-i, 
respectively.  Because  this  is  a  multivariate  regression  model  where  the  regres¬ 
sors  are  not  identical  in  the  different  equations,  we  assume  for  the  moment 
that  £u  is  also  known  and  consider  the  GLS  estimator 

(X>S>/K)  ®  (a'S-'a)-1 

x(lT  <81  a'N,.“1)vec  (s^Ayt  -  ay^Jy^ 

\t= i 


vec(P'A_r)) 
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P  '(K-r)  =  (a'i^a)  1o!Su  1 

x  ff:  9<iy? 


(7.1.12) 


\t=i  /  \t=i 

This  estimator  has  the  following  asymptotic  distribution. 

Result  4 


^(P(iC-r)  ~  P(A'-r)) 


d 


W*K_rdW 


'  (  r1 

*f  \  !  I  w*K_rW%_rds 


(7.1.13) 


where 

W*K_r  :=  Q22[ 0  :  /^_r]ry2Wif, 

Q22  denotes  the  lower  right-hand  (( K  —  r)  x  (A'  —  r))  block  of  Q-1  and 

w;  :=  (a 'E-la)-la'E-1Q-1Ell2vrK. 

Thus,  the  asymptotic  distribution  depends  on  functionals  of  a  standard 
Wiener  process.  ■ 

Proof:  Replacing  Ayt  —  a in  (7.1.12)  with  +  ut  and  rear¬ 

ranging  terms  gives 


P(A-r)  ^  P (K-r)  =  (“' Su  ±a)  ^ Su 


j:vAv?X 

t= 1  / 


(7.1.14) 


Thus,  we  have  to  consider  the  quantity 

T(sr  (2),N\  ( sr  (2)  (2)A 

For  the  first  matrix  on  the  right-hand  side  we  have 


T~ 


E(2 )/ 


t=l 
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=  T 


T 

-ir 


UtVt-i 


o 

Ik~t 


T 


=  Q~l  T^YvtzU  Q 


i-i' 


0 

Ik —r 


=  Q~X 

Op(!):r 

i=l 

Q-1' 

1 

o 

4 q -^y2  (y'w^) 

SV2 

1 - 1 

s- 

1 

o  4 

i _ i 

where  Lemma  7.1(2)  and  (3)  have  been  used  for  the  last  equality  and  the 
limiting  result,  respectively.  Thus, 


W*K_rdW*r'.  (7.1.15) 

t= i  Jo 

The  matrix 

t=  1 


0 

I K—r 


=  [0  :  Ik-t }  t  2J2yt-iy't-i 

\  t= i 

=  [o  :  Ik-AQ-1  )  Q-1' 


=  [0  :  Ik-v]Q~ 


t= i  / 

Op(l)  Op(l) 

Op(  1)  T-2EL 


0 

J-K—r 


Q-1' 


o 

I K—r 


=  Q22  T~2  Y  zt-izt-i  ]  Q22'  +  °p(1) 


4  Q22[0  :  lK_ri^v 


'  K  VV  K< 


)  Y'2 

'  0 

)  v 

Ik —r 

Q 


22' 


=  /  w*K_rw*Y 


ds , 


(7.1.16) 


where  Lemma  7.1(5)  has  been  applied.  Using  (7.1.14)  and  combining  (7.1.15) 
and  (7.1.16),  gives  the  result  in  (7.1.13).  ■ 
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Clearly,  in  the  present  model  setup,  the  GLS  estimator  of  does  not 

have  the  usual  normal  limiting  distribution.  In  fact,  it  converges  with  rate  T 
rather  than  the  usual  rate  \/T,  at  least  under  our  present  rather  restrictive 
assumptions.  The  asymptotic  distribution  consists  of  functionals  of  a  standard 
Wiener  process.  It  is  also  interesting  to  note  that  the  two  Wiener  processes 
W*  and  W*K_r  are  independent  because  their  cross-covariance  matrix  is 

Q22[ 0  :  lK-AKQ^'^aia'S-'a)-1 
=  Q22 [0  :  lK-.,]Qa{ a'S^a)-1 
=  <522a'La(a'T'“1a)_1 

=  0, 

where  Sv  =  QSUQ'  has  been  used  to  obtain  the  first  equality.  The  indepen¬ 
dence  of  the  two  Wiener  processes  implies  that  the  conditional  distribution 
of 

vec  W*K_rdW*r^ 
given  W*K_r  is 


JV  ^0,  J  W*K_rW%_rds  g>  (a'r-1^-1^ 

(see  Ahn  &  Reinsel  (1990),  Phillips  &  Park  (1988)  or  Johansen  (1995)).  This 
reasoning  leads  to  the  following  interesting  result. 


Result  5 


vec 


(PU-r)-PU-r))  E2«-l 


T  \  ^!2 

(2)  „(2)/  \ 


vt=l 


•  M  (0,IK-r  ®  (oc'r„  1a)  x) . 


(7.1.17) 


Proof:  From  (7.1.16)  we  have 


t= 1 


0 


Hence,  Result  5  follows  because 

rp  ^  1/2 

„(2)  „,(2)' 


vec 


(PU-r)-PU-r)) 

=  (  fe  vec(P(K-_r)  -  p \K-r))- 
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Result  5  means  that,  although  the  GLS  estimator  ft(K_r\  has  a  nonstan¬ 
dard  limiting  distribution,  a  transformation  is  asymptotically  normal  and  can, 
for  example,  be  used  to  construct  hypothesis  tests  with  standard  limiting 
distributions.  For  example,  f-ratios  can  be  constructed  in  the  usual  way  by 
considering  an  element  of  ft(K_r-)  and  dividing  by  its  asymptotic  standard 
deviation  obtained  from 

(EsSs/K)  ®  («'£.-'« r1- 

Also  Wald  tests  can  be  constructed  as  usual  (see  Appendix  C.7). 

Of  course,  the  GLS  estimator  is  only  available  under  the  very  restrictive  as¬ 
sumption  that  both  a  and  Eu  are  known.  It  turns  out,  however,  that  the  same 
asymptotic  distribution  is  obtained  for  the  corresponding  EGLS  estimator, 


P(if-r)  — 

(a  S-'SI-'a  S-1 


(7.1.18) 


where  a  and  Su  are  consistent  estimators  of  a  and  SUl  respectively.  Fortu¬ 
nately,  such  estimators  are  available  in  the  present  case.  A  consistent  estimator 
a  follows  from  Result  2.  If  (3  is  normalized  as  in  (7.1.10),  the  first  r  columns  of 
II  are  equal  to  a.  Hence,  the  first  r  columns  of  II  are  a  consistent  estimator  of 
a  and  the  usual  white  noise  covariance  matrix  estimator  from  the  unrestricted 
LS  estimation  can  be  shown  to  be  a  consistent  estimator  of  Su,  as  we  will 
demonstrate  later  (see  Result  8).  The  following  result  can  be  established. 

Result  6 


W{K_T)-V{K_r))  =  op(  1). 


(7.1.19) 


Proof:  Defining  vf  =  Ayt  —  aft  yt-\  and  substituting  aftIK  _ryy^l  +  u%  for 
Ay i  —  ajyftft-y  in  (7.1.18)  gives,  after  rearrangement  of  terms, 


ftu<-r)  ^  P \K-r)  =  (a  Z-'ar'a's-1 


*  (2)' 
“t»t- 1 


EMI 

^t=i  / 


Hence, 
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=  [(«'£„  1S)  1aEu  1  -  {a'Eu  :a)  1a'Eu 

x  fr- £ fr-' ££>,£>;) 

+(aE-1a)-1aE~1 

x  (t-‘  £(„;  -  u,)yp  (t-*  £  . 

The  term  in  brackets  is  op(  1)  because  a  and  Eu  are  consistent  estimators 
by  assumption.  Moreover,  T-1  Y£t=i(ut  ~  ut)Ut- [  =  op(l)  (see  Problem  7.1). 
Thus,  the  desired  result  follows  because  all  other  terms  converge  as  established 
previously.  ■ 

If  the  process  is  assumed  to  be  Gaussian,  ML  estimation  may  be  used 
alternatively.  In  case  a  and  Eu  are  known,  the  ML  estimator  is  identical  to 
the  GLS  estimator  for  ^K_r.^  and,  hence,  P(K-_rj  is  also  the  ML  estimator.  If 
a  and  Eu  are  unknown,  ML  estimation  under  the  constraint  rk(II)  =  r  may 
be  used.  The  log- likelihood  function  is 


lnZ  =  —  In 27t  —  In  | Eu |  —  ^  ^{Ayt  -  Uy^)' Eu  1{Ayt  -  U.yt_x). 

z  t= l 

(7.1.20) 

From  Chapter  3,  we  know  that  maximizing  this  function  is  equivalent  to  min¬ 
imizing  the  determinant 

T 

T"1  ^(Ayt  -  Hyt_i)(Ayt  -  Uy^)'  . 

t- 1 

To  impose  the  rank  restriction  rk(LE)  =  r,  we  write  II  =  apA  where  a  and 
P  are  (K  x  r)  matrices  with  rank  r.  For  the  moment  we  do  not  impose  any 
normalization  restrictions  and  consider  minimization  of  the  determinant 

T 

T_1  -  a^yt-i)(Ayt  -  aP'j/t-i)' 

t= i 

with  respect  to  a  and  p.  This  minimization  problem  is  solved  in  Proposition 
A. 7  in  Appendix  A. 14  and  the  solution  is  obtained  by  considering  the  eigen¬ 
values  Ai  >  •  •  •  >  Xk  and  the  associated  orthonormal  eigenvectors  Vi, . . .  ,Vk 
of  the  matrix 


-1/2 


7.1  Estimation  of  a  Simple  Special  Case  VECM  283 


T 


-1/2 


x  ^  ij 

The  minimum  of  the  determinant  is  attained  for 


-1/2 


p  =  [vi,...,vr]' 


(7.1.21) 


vt=l 


and 


\t= i  /  \*=i  y 


(7.1.22) 


Clearly,  the  resulting  ML  estimator  II  =  aP'  for  II  must  have  the  same 
asymptotic  properties  as  the  unrestricted  LS  estimator  of  II  because  even 
the  estimator  in  Result  3,  which  is  based  on  a  known  p  does  not  have  better 
properties.  Notice  that,  for  a  Gaussian  model,  the  LS  estimator  based  on  a 
known  p  is  equal  to  the  ML  estimator  because  the  same  regressors  appear  in 
all  equations.  Thus,  we  can  draw  the  following  conclusion. 

Result  7 


Vfve c(dp'  -  n)  4  A/(0,  p^1))-^'  ®  Uu). 


(7.1.23) 


This  result  was  derived  by  Johansen  (1995)  and  other  authors  for  more 
general  models.  It  is  also  interesting  to  note  that  we  can,  of  course,  normalize 
the  ML  estimator  for  p  as  in  (7.1.10),  that  is,  we  postmultiply  the  estimator  in 
(7.1.21)  by  the  inverse  of  the  upper  (r  x  r)  submatrix.  Denoting  the  normalized 
estimator  by  P  and  using  the  corresponding  estimator  for  a  from  (7.1.22), 

«=  (jZAyty't- 

gives  an  estimator  dp'  of  II  which  is  identical  to  dp'.  Thus,  the  asymptotic 
properties  must  also  be  identical.  It  follows  that  a  has  the  same  asymptotic 
distribution  as  the  LS  estimator  in  (7.1.9).  Moreover,  the  asymptotic  distri¬ 
bution  of  the  lower  (( K  —  r)  x  r)  part  of  p  is  the  same  as  that  of  the  GLS 
estimator  in  Result  4  because 


P(if-r)  — 

(d'^-'a)-1^^-1  (ELi  (Avt 


(E 


T 

t—1 


(2)  (2)/ 
Vt-iVt-i 
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where  the  ML  estimator  Su  is  substituted  for  Su.  Thus,  the  asymptotic  dis¬ 
tribution  of  follows  from  Result  6  and  the  consistency  of  the  ML 

estimators  a  and  Uu. 

In  fact,  any  of  the  estimators  for  II  which  we  have  considered  so  far,  leads 
to  a  consistent  estimator  of  the  white  noise  covariance  matrix  of  the  form 

T 

2*  =  T -1  ^2(Ayt  -  m k-x){Ayt  -  Uyt- i)'.  (7.1.24) 

t= i 


Here  II  can  be  any  of  the  estimators  for  II  considered  so  far,  because  they 
are  all  asymptotically  equivalent.  The  following  result  can  be  established. 

Result  8 

plim  Su  =  Su.  (7.1.25) 


Proof:  Notice  that 

X 

2U  =  T~x  ^(IIyt_i  -  +  ut){Hyt~i  -  LL/t_i  +  ut)' 


1 

T 


t=l 


=  T-1  Y  Utu't  +  (n  -  n)  T-1  Y yt-iy't-i )  (n  -  n)' 

4=1  \ 

+(n-ft)  • 

Using  a  standard  law  of  large  numbers, 


(7.1.26) 


plim  T  1  Yutu't  =  Su. 

t=  1 

Thus,  it  suffices  to  show  that  all  other  terms  are  op(  1).  This  property  follows 
because  from  Lemma  7.1  we  have 

T 

T-1Yvt-M  =  op{  1) 

t= 1 


and 


T 


T-^pVi^-iP 

t=  1 


Op(  1). 
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Using  the  estimator  a(3/  for  II,  it  is  easily  seen  that  all  terms  but  the  first 
on  the  right-hand  side  of  the  last  equality  sign  in  (7.1.26)  converge  to  zero 
in  probability.  The  argument  is  easily  extended  to  the  other  estimators  by 
noting  that  their  difference  to  the  previously  treated  estimator  is  op(T-1/2). 


So  far  we  have  assumed  that  r  /  0  and,  hence,  II  ^  0.  This  assumption  is 
of  obvious  importance  for  some  of  the  results  to  hold  and  some  of  the  proofs 
to  work.  If  II  =  0,  the  analysis  becomes  even  simpler  in  some  respects.  In 
that  case,  yt  is  a  multivariate  random  walk  and  we  can  apply  Proposition 
C.18  directly  to  evaluate  the  asymptotic  properties  of  the  term 

n n - n)  =  (r-1  e Wt-i)  (^-2 E yt-iy't-i 

where  II  is  again  the  LS  estimator.  Using  Proposition  C.18(6)  and  (9)  gives 
the  following  result. 

Result  9 

If  the  cointegrating  rank  r  =  0, 

T(n-n)  -4  eH2  (/^  (|w,w  "Kd^j  r-1/2.  (7.1.27) 


The  LS  estimator  is  again  identical  to  the  ML  estimator  and,  hence,  the 
same  result  is  obtained  for  the  latter.  On  the  other  hand,  the  GLS  estimator 
is  not  applicable  here.  Now  we  cannot  even  use  the  usual  Tratios  anymore  in 
a  standard  way  because  they  do  not  have  a  limiting  standard  normal  distri¬ 
bution  in  this  case.  For  the  special  case  of  a  univariate  model  this  can  be  seen 
from  Appendix  C.8.1.  Notice  that  for  K  —  1,  II  =  p  —  1  in  Proposition  C.17 
and,  thus,  the  asymptotic  distribution  of  Til  =  T(p  —  1)  is  clearly  different 
from  the  standard  normal  in  this  case. 

The  results  for  the  estimator  of  the  VECM  imply  analogous  results  for  the 
parameters  of  the  corresponding  levels  VAR  form  yt  =  A\yt-\  +  ut ■  Notice 
that  Ai  =  II  +  Ik-  Consequently,  we  have  for  the  LS  estimator,  for  example, 

A1-A1  =  ft  -  II.  (7.1.28) 

Hence,  the  asymptotic  properties  of  A\  follow  immediately  from  those  of  II. 

The  simple  model  we  have  discussed  in  this  section  shows  the  main  differ¬ 
ences  to  the  stationary  case.  All  the  results  can  be  extended  to  richer  models 
with  short-term  dynamics  and  deterministic  terms.  Estimation  of  such  models 
will  be  considered  in  the  next  section. 
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7.2  Estimation  of  General  VECMs 


We  first  consider  a  model  without  deterministic  terms, 


Ayt  —  n?/t_i  +  TiAyt-i  +  •  •  •  +  Tp-iAyt-p+i  +  ut,  (7.2.1) 

where  t/t  is  a  process  of  dimension  K,  rk(II)  =  r  with  0  <  r  <  K  so  that 
n  =  a(3'  ,  where  a  and  (3  are  (K  x  r)  matrices  with  rk(a)  =  rk( (3)  =  r.  All  other 
symbols  have  their  conventional  meanings,  that  is,  the  Tj  (j  =  1, ...  ,p  —  1) 
are  ( K  x  K )  parameter  matrices  and  ut  ~  (0,  Eu)  is  standard  white  noise. 
Also,  yt  is  assumed  to  be  an  7(1)  process  so  that 


(7-2-2) 

is  nonsingular  (see  Section  6.3,  Eq.  (6.3.12)).  These  conditions  are  always 
assumed  to  hold  without  further  notice  when  the  VECM  (7.2.1)  is  considered 
in  this  chapter. 

For  estimation  purposes,  we  assume  that  a  sample  yi,...,yr  and  the 
needed  presample  values  are  available.  It  is  then  often  convenient  to  write 
the  VECM  (7.2.1),  for  t  =  1, . . . ,  T,  in  matrix  notation  as 

AY  =  mil  +  r  AX  +  U,  (7.2.3) 

where 


AY  :=  [Ayi, . . . ,  AyT\, 

Y-i  ■=  [yo,---,yr-i\i 
r  :=  [Fi, . . . , rp_i], 

AX  :=  [AX0, . . . ,  AXt_x\  with  AXt_1  := 


and 


Ayt~  i 
Ayt-p+i 


U  :=  [ui, . . .  ,uT\. 

We  will  now  consider  LS,  EGLS,  and  ML  estimation  of  the  parameters  of 
this  model.  Estimation  of  the  parameters  of  the  corresponding  levels  VAR 
form  will  also  be  discussed  and,  moreover,  we  comment  on  the  implications 
of  including  deterministic  terms. 


7.2  Estimation  of  General  VECMs 


287 


7.2.1  LS  Estimation 


From  the  matrix  version  (7.2.3)  of  our  VECM,  the  LS  estimator  is  seen  to  be 


[n:f] 


[AYY!_X  :  AY  AX'] 


Y-iY^  Y-iAX' 
AXY’_x  AX  AX’ 


(7.2.4) 


using  the  usual  formulas  from  Chapter  3.  The  corresponding  white  noise  co- 
variance  matrix  estimator  is 


2U  :=  (T  -  Kp)~1(AY  -  nr_!  -  TAX)(AY  -  flF_i  -  f  AX')'.  (7.2.5) 


The  asymptotic  properties  of  these  estimators  are  given  in  the  next  proposi¬ 
tion. 


Proposition  7.1  (Asymptotic  Properties  of  the  LS  Estimator  for  a  VECM) 
Consider  the  VECM  (7.2.1).  The  LS  estimator  given  in  (7.2.4)  is  consistent 
and 


Vt vec([n  :  f]  -  [II  :  T])  ^N(0,Xco), 
where 


— 


P  0 

0  Ikp-k 


Q- 


p'  0 

0  1  Kp—K 


and 


^  =  plim  - 


The  matrix 


p'r.iY^p  p'yiiAX7 
AXY'_x  p  AX  AX’ 


p 

0 

rp'  oi 

0 

h-Kp-K 

* 

i 

a 

* 

--i 

o 

(7.2.6) 


is  consistently  estimated  by 


r  Y.iYLx  Y_xAX'  1 
[  AXYfx  AX  AX' 

and  Eu  is  a  consistent  estimator  for  Eu.  ■ 

This  proposition  generalizes  Result  2  of  Section  7.1.  Therefore  similar  re¬ 
marks  can  be  made. 


Remark  1  The  covariance  matrix  Eco  is  singular.  This  property  is  easily 
seen  by  noting  that  fl  is  a  [(Kp  —  K  +  r)  x  (Kp  —  K  +  r)]  matrix.  Thus,  the 
rank  of  the  (K2p  x  K 2p)  matrix  Eco  cannot  be  greater  than  K(Kp  —  K  +  r) 
which  is  smaller  than  K2p  under  our  assumption  that  r  <  K.  Still,  f-ratios 
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can  be  set  up  and  interpreted  in  the  usual  way  because  they  have  standard 
normal  limiting  distributions  under  our  assumptions.  In  contrast,  Wald  tests 
and  the  corresponding  /’’-tests  of  linear  restrictions  on  the  parameters  may 
not  have  the  usual  asymptotic  y-  or  approximate  /^-distributions  that  are 
obtained  for  stationary  processes.  A  more  detailed  discussion  of  this  issue  will 
be  given  in  Section  7.6.  ■ 


Remark  2  If  p  is  known,  the  LS  estimator 


[«:?] 


[AYY!_X  P  :  AY  AX'] 


p'lAiY^p  p 'Y^AX' 
AXY!_^>  AX  AX' 


(7.2.7) 


of  [oc  :  r]  may  be  considered.  Using  standard  arguments  for  stationary  pro¬ 
cesses,  its  asymptotic  distribution  is  seen  to  be 


VT vec([a  :  f]  -  [a  :  T])  4  A/(0,  Ua,r), 
where 


(7.2.8) 


ZZa.r  —  fl  ®  £u 


plim  T 


P'lAiW^  p'U_iZ\A' 
AXY'_x  p  AX  AX' 


-l 


Xu 


The  asymptotic  distribution  in  (7.2.8)  is  nonsingular  so  that,  for  given  P, 
asymptotic  inference  for  a  and  T  is  standard.  Noting  that 


[Sp'  :  f ]  -  [n  :  r]  =  ([a  :  f  ]  -  [a  :  T]) 


P'  0 
0  Ikp-k 


it  is  easy  to  see  that 
vec([Sp'  :  f]  -  [n  :  T]) 

has  the  same  asymptotic  distribution  as  the  LS  estimator  in  Proposition  7.1. 
This  finding  corresponds  to  Result  3  in  Section  7.1.  It  means  that,  whether 
the  cointegrating  matrix  P  is  known  or  estimated  is  of  no  consequence  for  the 
asymptotic  distribution  of  the  LS  estimators  of  II  and  T.  The  reason  is  that 
P  is  estimated  “superconsistently”  even  if  LS  estimation  is  used.  This  point 
will  be  discussed  further  in  Section  7.2.2.  ■ 


Remark  3  If  the  cointegrating  rank  r  =  0  and,  thus,  II  =  0, 

Vf[n  -  n]  =  0p(  l), 

that  is,  the  LS  estimator  of  II  converges  faster  than  with  the  usual  rate  v/T. 
Therefore,  Proposition  7.1  remains  valid  in  the  sense  that  all  parts  of  the 
asymptotic  covariance  matrix  in  (7.2.6)  related  to  II  have  to  be  set  to  zero. 
In  other  words,  the  first  K2  rows  and  columns  of  Eco  are  zero.  ■ 
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Remark  4  From  Proposition  7.1  it  is  also  easy  to  derive  the  asymptotic 
distribution  of  the  LS  estimator  for  the  parameters  of  the  levels  VAR  form 
corresponding  to  our  VECM, 

Vt  =  Aiyt_i  +  •  •  •  +  Apyt-p  +  Ut .  (7.2.9) 

The  Aj’s  are  related  to  the  VECM  parameters  by 

Ai  =  n  +  jK  +  rx 

Ai  =  Ti-IVi,  i  =  2,...,p-l,  (7.2.10) 

Ap  =  —  rp_i 

(see  also  (6.3.7)).  Hence,  they  are  obtained  by  a  linear  transformation, 

A  :=  [Ai  :  •  •  •  :  Ap]  =  [EE  :  T]W  +  J,  (7.2.11) 

where 

J:=[IK  :  0  :  •  •  •  :  0]  {K  x  Kp) 

and 

I K  0  0  ...  0 

Ik  -Ik  0  ...  0 

0  lK  -IK  0 

W:=  : 

.0  0  .  IK  -1k  J 

Consequently,  using 

vec([II  :  T}W)  =  {W  ®  IK)  vec[n  :  T], 

we  get  the  following  implication  of  Proposition  7.1  (see  also  Sims,  Stock  & 
Watson  (1990)). 

Corollary  7.1.1 

Under  the  conditions  of  Proposition  7.1, 

Vfvec {A  -  A)  4  A^(0,  A“), 


Furthermore 
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A™  =  (II')"1  ®  \{Y  -  AX)(Y  -  AX)'] 


is  a  consistent  estimator  of  Here  Y  :=  [yi, . . . ,  yT\  and 


X~[Y0,...,Yt_1]  with  rt_!  := 


yt- 1 

Ut—p 


Because  A™  is  singular,  A  also  has  a  singular  asymptotic  distribution.  The 
distribution  in  Corollary  7.1.1  remains,  in  fact,  valid  if  r  =  0.  ■ 


Discussion  of  the  Proof  of  Proposition  7.1 


The  proof  of  Proposition  7.1  is  a  generalization  of  that  of  Result  2  in  Section 
7.1.  Multiplying 


Vt 

AXt 

by 


■p' 

0 

<3*  := 

0 

IK(p- D 

_  oc(l 

0 

. 

gives  a  process 

■ 

AD  1 

" 

1 

Zt  = 

‘  <N 

:=Q* 

yt 

AXt 

where 


2 


(i) 

t 


P  'yt 

AXt 


(7.2.12) 


contains  7(0)  components  only  and  zf^  :=  a '±yt  consists  of  7(1)  components 
(see  Proposition  6.1).  Therefore,  a  lemma  analogous  to  Lemma  7.1  can  be 
established  and  used  to  prove  Proposition  7.1.  We  leave  the  details  as  an 
exercise  (see  Problem  7.2). 

In  fact,  via  the  process  Zt,  we  can  get  the  following  useful  lemma  from 
standard  weak  laws  of  large  numbers  and  central  limit  theorems  for  stationary 
processes  (see  Appendix  C.4)  as  well  as  Proposition  C.18  of  Appendix  C. 
It  summarizes  a  number  of  convergence  results  for  variables  generated  by 
the  VECM  (7.2.1).  Some  of  these  or  similar  results  were  derived  by  different 
authors  including  Phillips  &  Durlauf  (1986),  Johansen  (1988),  Ahn  &  Reinsel 
(1990)  and  Park  &  Phillips  (1989). 
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Lemma  7.2 

(1)  AX  AX'  =  Op(T)  and  (T"1  AX  AX')-1  =  Op(  1). 

(2)  pV.iAT  =  Op(T). 

(3)  p =  Op(T)  and  (T^Y^YL^)-1  =  Op(  1). 

(4)  p 'Y_XU' =  Op(T1/2). 

(5)  p'E_iZ\F'  =  OpiT1/2). 

(6)  Y.XU'  =  Op(T). 

(7)  Y.xAX'  =  0P(T). 

(8)  p 'Y^YU  =  Op(T). 

(9)  =  Op(T2). 


Some  of  these  results  are  helpful  in  deriving  Proposition  7.1  and  they  are 
also  useful  in  proving  the  next  propositions.  Because  AY,  p  F_i,  and  AX 
contain  7(0)  variables  only,  essentially  the  same  results  as  in  the  stable  case 
hold  for  these  quantities.  This  is  reflected  in  Lemma  7.2(l)-(5).  On  the  other 
hand,  Yli  contains  /(l)  variables  that  behave  differently  from  1(0)  variables. 
For  instance,  for  a  stable  process,  Y_x  Y!_x/T  has  a  fixed  probability  limit  (see 
Chapter  3).  Now  the  corresponding  quantity  Y-{Y'_X  is  Op(T2).  Intuitively, 
the  reason  is  that  integrated  variables  do  not  fluctuate  around  a  constant 
mean  but  are  trending.  Thus,  the  sums  of  products  and  cross-products  go  to 
infinity  (or  minus  infinity)  more  rapidly  than  for  stable  processes. 


7.2.2  EGLS  Estimation  of  the  Cointegration  Parameters 

For  GLS  estimation  we  assume  that  P  is  normalized  as  in  (7.1.10), 

P  =  [  p 

L  P(K-r)  J 

Because  we  are  primarily  interested  in  estimating  P(#-_r),  we  concentrate  on 
the  error  correction  term  and  replace  the  short-run  parameters  T  by  their  LS 
estimators  for  a  given  matrix  II, 

f  (n)  =  (AY  -  UY^)  AX' (AX  AX')-1. 

Hence, 

ay  =  nr_i  +  (ay  -  m  _i )ax'(axax')~1ax  +  u*. 

Rearranging  terms  and  defining  the  (T  x  T)  matrix 
M  :=  IT  -  AX' (AX  AX'y1  AX, 
gives 


292 


7  Estimation  of  Vector  Error  Correction  Models 


i?0  =  UR1  +  U*  =  a$R1  +  [/*,  (7.2.13) 

where 

Ro  :=  AYM  and  R\  :=  Y_\M. 

Notice  that  Rq  is  just  the  residual  matrix  from  a  (multivariate)  regression  of 
Ayt  on  AXt_i  and  R±  is  the  matrix  of  residuals  from  a  regression  of  yt- 1 
on  AXt_i.  Denoting  the  first  r  and  last  K  —  r  rows  of  R\  by  R^  and  R<T>, 
respectively,  and  using  the  normalization  of  (3,  (7.2.13)  can  be  rewritten  as 

R0  -  aR[1]  =  a ${K_.r)R?)  +  U* .  (7.2.14) 

Based  on  this  “concentrated  model”  the  GLS  estimator  of  $'^K_rj  is 

P(*-_r)  =  (a'l;-1^-1^!;-1^  -  aR^)R^'  (jj<2)42)')_1  (7.2.15) 

(see  Eq.  (7.1.12)).  Note  that  the  same  estimator  is  obtained  if  the  short-run 
parameters  are  not  concentrated  out  first  because  T  has  been  replaced  by 
the  optimal  matrix  for  any  given  matrix  II.  As  in  the  simple  special  case 
model  considered  in  Section  7.1,  it  is  now  obvious  how  to  obtain  a  feasible 
GLS  estimator.  In  a  first  estimation  round  we  determine  the  LS  estimator  of 
[n  :  r]  as  in  (7.2.4)  and  Xu  as  in  (7.2.5).  Using  the  first  r  columns  of  II  as 
an  estimator  a,  we  get  the  EGLS  estimator 

_r)  =  (aE^a^aE-^Ro  -  a R^R™'  (r^R^  1 .  (7.2.16) 

This  estimator  was  proposed  by  Ahn  &  Reinsel  (1990)  and  Saikkonen  (1992) 
(see  also  Reinsel  (1993,  p.  171)).  Its  asymptotic  properties  are  analogous  to 
those  of  the  EGLS  estimator  for  the  simple  model  considered  in  Section  7.1. 
They  are  summarized  in  the  following  proposition  which  was  proven  by  Ahn 
&  Reinsel  (1990). 

Proposition  7.2  ( Asymptotic  Properties  of  the  EGLS  Estimator  for  the 
Cointegration  Matrix ) 

Consider  the  VECM  (7.2.1)  with  cointegration  matrix  p  normalized  as  in 
(7.1.10).  Suppose  a  and  Xu  are  consistent  estimators  of  a  and  Xu,  respec¬ 
tively.  Then  the  EGLS  estimator  of  P(K_r,  given  in  (7.2.16)  has  the  following 
asymptotic  distribution: 

nP(K-r)  -  P U-r))  4  ({  W£_rdWf )  W#_rW#'_rdS)  , 

(7.2.17) 

where  and  YVf  are  suitable  independent  ( K  —  r)-  and  r-dimensional 

Wiener  processes,  respectively,  whose  parameters  depend  on  those  of  the 
VECM.  Furthermore, 


P  (K 
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vec  (i^_r)-p,(if_r))(42)42),)1/2  ±M{0,lK-r®idZZ1a)-1). 

(7.2.18) 


Remark  1  The  EGLS  estimator  has  the  same  asymptotic  distribution  as 
the  GLS  estimator.  Moreover,  it  has  the  same  asymptotic  distribution  one 
would  obtain  if  all  parameters  (a,  T,  and  Uu)  except  were  known.  It 

converges  at  rate  T.  Hence,  is  a  superconsistent  estimator  of 

and,  thus, 

It 

P(iC  — r) 

is  a  superconsistent  estimator  of  p.  The  precise  form  of  the  Wiener  processes 
W*_r  and  W f  depends  on  the  short-run  dynamics  of  the  process  yt.  It  is 
given,  for  example,  in  Ahn  &  Reinsel  (1990).  ■ 

Remark  2  The  matrix 

T~2RiR[  =  T~2Y_iMY!_1 

=  t~2y_xy'_y  -  r2T_i/\j,(r1ziiM')'1J,_MxT!1 

=  T-2Y^Y'_^ov[\)Ov{\)Ov{Y) 

=  T~2Y_xY'_y+ov{  1), 

where  Lemma  7.2(1)  and  (7)  have  been  used.  This  result  implies  that  (7.2.18) 
could  be  stated  alternatively  as 

vec  J'(K_r)  -  P'(*_r))  (yWyW') 1/2  4  A7(0,  IK  ®  (a'A"1^-1) , 

where  Y_±  contains  the  last  K  —  r  rows  of  Y_\.  For  practical  purposes,  the 
result  as  stated  in  (7.2.18)  is  more  useful  because  it  can  be  used  directly  for 
setting  up  meaningful  f-ratios  and  Wald  or  F-tests  for  hypotheses  about  the 
coefficients  of  P(x_r)-  These  quantities  have  the  usual  asymptotic  or  approxi¬ 
mate  distributions.  Of  course,  the  same  is  true  if  (R^ R^')1/2  is  replaced  by 
{Y^Y^'y/2.  Still,  in  small  samples  it  is  advantageous  to  take  the  short-run 
dynamics  into  account  as  in  (i?[2■)i?J2■*,)1/2.  ■ 

Remark  3  It  is  also  possible  to  replace  p  in  II  =  ap/  in  (7.2.3)  by  the  EGLS 
estimator  and  estimate  the  other  parameters  by  LS  from  the  model 

AY  =  aP'F_i  +  TAX  +  U* . 
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The  resulting  estimator  [a  :  T]  has  the  same  asymptotic  properties  as  [a  :  T] 

in  (7.2.7)  which  is  based  on  a  known  (3.  As  a  consequence,  [a(3'  :  T]  also  has 
the  same  asymptotic  properties  as  [a(3'  :  T].  ■ 

Remark  4  The  EGLS  estimator  was  actually  presented  in  a  slightly  different 
form  by  Ahn  &  Reinsel  (1990)  and  Saikkonen  (1992).  These  authors  use  the 
representation 

f{K-r)  = 

where  II2  is  the  ( K  x  (K  —  r))  matrix  of  the  last  K  —  r  columns  of  the  LS 
estimator  II  of  II  (see  Reinsel  (1993,  p.  171)  for  a  discussion  of  the  equivalence 
of  this  estimator  and  the  EGLS  estimator  (7.2.16)).  ■ 


7.2.3  ML  Estimation 

If  the  process  yt  is  Gaussian  or,  equivalently,  ut  ~  A/’(0,Atl),  the  VECM 
(7.2.1)  can  be  estimated  by  maximum  likelihood  (ML)  taking  also  the  rank 
restriction  for  II  =  ap'  into  account  (see  Johansen  (1988,  1995)).  The  log- 
likelihoocl  function  for  a  sample  of  size  T  is 

KT  T 

In l  =  ln27r  -  —  In  \SU\ 

-itr  [{AY  -  ocpV_i  -  TAX)'E~1{AY  -  ocpV_i  -  TZ\A)]  . 

(7.2.19) 

In  the  following,  we  will  first  discuss  the  computation  of  the  estimators  and 
then  consider  their  asymptotic  properties. 


The  Estimator 

For  ML  estimation  we  do  not  assume  that  P  is  normalized.  We  only  make 
the  assumption  rk(II)  =  r  which  implies  that  the  matrix  can  be  represented 
as  n  =  ap',  where  a  and  p  are  {K  x  r)  with  rk(a)  =  rk(p)  =  r.  In  the 
next  proposition  the  ML  estimators  are  given.  The  proposition  generalizes 
the  special  case  estimators  given  in  (7.1.21)  and  (7.1.22). 

Proposition  7.3  {ML  Estimators  of  a  VECM) 

Let  M  :=  IT  -  AX' {AX  AX')-1  AX ,  R0  :=  AYM  and  i?i  :=  V_iM,  as 
before,  and  define 

S,j  :=  RjR'j/T.  i  =  0, 1, 


t  /2  _ 1 

Ai  >  •  •  •  >  A k  are  the  eigenvalues  of  S±1  SiqSq0  Soil'll 


-1/2 
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and 


Vi, . . . ,  V k  are  the  corresponding  orthonormal  eigenvectors. 


The  log-likelihood  function  in  (7.2.19)  is  maximized  for 

P  =  p  :=  [Vl, . . .  ,v)-]/5'-1/2, 

a  =  a:=AYMYLpl($Y-lMYL&yl  =  SoiM'-SiiP)"1, 

T  =  f  :=  (AY -aft  Y_x)  AX' (AX  AX')-1, 

Su  =  Su  :=  (AY  -  ap 'Y_x  -  f  AX)  (AY  -  aftY^  -  TAX)'/T. 


The  maximum  is 


max  In  l  =  — 


I<T 

2 


In  27T  — 


T 

2 


In  (S'ool  +  y^ln(l  -  A») 

i=l 


KT 

~Y~ 


(7.2.20) 


Proof:  From  Chapter  3,  Section  3.4,  it  is  known  that  for  any  fixed  a  and  p 
the  maximum  of  In  l  is  attained  for 


r(ap')  =  (AY  -  a$'Y-i)  AX'  (AX  AX')~l . 


Thus,  we  replace  T  in  (7.2.19)  by  r(ocp/)  and  get  the  concentrated  log- 
likelihood 


KT 

2 


T 

In  27t - In 

2 


\E»\ 


—  ^tr  [( AYM  -  apY_i M)'E~1(AYM  -  aP'WiM)] . 

Hence,  we  just  have  to  maximize  this  expression  with  respect  to  a,  P,  and 
Su.  We  also  know  from  Chapter  3  that,  for  given  a  and  P,  the  maximum  is 
attained  if 


y(ap')  =  (AYM  -  aft  Y_1M)(AY  M  -  aftY^Mf/T 


is  substituted  for  Su.  Consequently,  we  have  to  maximize 

-j  In  | (AYM  -  aftY_iM)(AYM  -  aftY^M)' /T\ 

or,  equivalently,  minimize  the  determinant  with  respect  to  a  and  p.  Thus,  all 
results  of  Proposition  7.3  follow  from  Proposition  A. 7  of  Appendix  A.  14.  ■ 

The  solutions  p  and  a  of  the  optimization  problem  given  in  the  propo¬ 
sition  are  not  unique  because,  for  any  nonsingular  (r  x  r)  matrix  Q,  aQ-1 
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and  (3 Q'  represent  another  set  of  ML  estimators  for  a  and  p.  However,  the 
proposition  shows  that  explicit  expressions  for  ML  estimators  are  available.  If 
r  =  K,  the  proposition  still  remains  valid.  Also,  ML  estimators  for  the  levels 
VAR  representation  corresponding  to  the  VECM  (7.2.1)  observing  the  rank 
restriction  are  readily  available  via  the  relations  in  (7.2.10). 

The  next  question  concerns  the  properties  of  the  ML  estimators  of  a  coin¬ 
tegrated  system.  They  are  discussed  in  the  following. 


Asymptotic  Properties  of  the  ML  Estimator 

The  following  proposition  generalizes  Result  7  of  Section  7.1. 

Proposition  7.4  ( Asymptotic  Properties  of  the  ML  Estimators  of  a  VECM) 
The  ML  estimators  for  the  VECM  (7.2.1)  given  in  Proposition  7.3  have  the 
following  asymptotic  properties: 

Vt vec([ap'  :  f]  -  [n  :  T])  4  A/"(0,  Uco),  (7.2.21) 

where  Vco  is  as  defined  in  Proposition  7.1,  and 

Vfvech(Zu  -  Eu)  4a/'(0,2D+  (V„  ®  27„)D+').  (7.2.22) 

Furthermore,  Eu  is  asymptotically  independent  of  a|V  and  f .  Here,  as  usual, 
D)).  =  (D'ArD/f)_1D,A:  and  D k  is  the  ( K 2  x  \K(K  +  1))  duplication  matrix. 

■ 

Remark  1  It  is  clear  that  the  ML  estimator  of  [II  :  T]  must  have  the  same 
asymptotic  distribution  as  the  LS  estimator  in  Proposition  7.1  because  the 
ML  estimator  with  known  or  given  cointegration  matrix  P  also  has  the  same 
asymptotic  distribution.  The  ML  estimator  ap'  of  II  in  Proposition  7.3  may 
be  viewed  as  a  restricted  LS  estimator  which  is  not  as  much  restricted  as  the 
one  with  known  p.  Thus,  the  asymptotic  result  in  (7.2.21)  is  not  surprising. 
A  rigorous  proof  of  the  result  is  given  in  Johansen  (1995).  ■ 

Remark  2  The  covariance  matrix  Aco  is  singular,  as  noted  in  Remark  1  for 
Proposition  7.1.  The  rank  of  the  ( K2p  x  K2p )  matrix  Aco  cannot  be  greater 
than  K  (Kp  —  K  +  r)  which  is  smaller  than  K2p  if  r  <  K.  ■ 

Remark  3  Individually,  the  matrices  a  and  p  cannot  be  estimated  consis¬ 
tently  without  further  constraints.  Under  the  assumptions  of  Proposition  7.4, 
these  matrices  are  not  identified  (not  unique).  If  we  make  specific  identify¬ 
ing  assumptions  in  order  to  obtain  unique  parameter  values  and  estimators, 
consistent  estimation  is  possible.  For  instance,  we  may  use 
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The  ML  estimator  of  (3  (K-r)  maY  be  obtained  from  the  ML  estimator  of  (3 
given  in  Proposition  7.3  by  denoting  the  first  r  rows  of  (3  by  \i(r\  and  letting 
consist  of  the  last  K  —  r  rows  of  |3|3~*.  This  ML  estimator  has  the 
same  asymptotic  properties  as  the  EGLS  estimator  in  Proposition  7.2  (see 
Ahn  &  Reinsel  (1990)).  In  other  words,  inference  procedures  based  on  the  ML 
estimator  can  be  derived  from  the  result 


vec 


r) 


1/2] 


4^(0, ®(a,i:-1a)-1). 


It  was  found  in  a  number  of  studies  that  the  ML  estimator  |3 (k-t)  may  have 
some  undesirable  properties  in  small  samples  and,  in  particular,  it  may  pro¬ 
duce  occasional  outlying  estimates  which  are  far  away  from  the  true  parameter 
values  (e.g.,  Phillips  (1994),  Hansen,  Kim  &  Mittnik  (1998)).  This  behavior 
of  the  estimator  is  due  to  the  lack  of  finite  sample  moments.  Briiggemann  & 
Liitkepohl  (2004)  compared  the  EGLS  and  ML  estimators  in  a  small  Monte 
Carlo  study  and  found  that  the  EGLS  estimator  is  more  robust  in  this  respect. 


Remark  4  If  p  is  identified,  the  corresponding  ML  estimator  of  a  is  asymp¬ 
totically  normal,  i.e.,  \/T  vec(a  —  a)  converges  to  the  same  asymptotic  distri¬ 
bution  as  in  Remark  2  for  Proposition  7.1.  ■ 

Remark  5  The  normality  of  the  process  is  not  essential  for  the  asymptotic 
properties  of  the  estimators  T  and  II  =  af3'.  Much  of  Proposition  7.4  holds 
under  weaker  conditions  when  quasi  ML  estimators  based  on  the  Gaussian 
likelihood  function  are  considered.  We  have  chosen  the  normality  assumption 
for  convenience.  ■ 

Remark  6  The  asymptotic  distribution  of  Su  may  be  different  if  ut  is  not 
Gaussian.  The  limiting  distribution  in  (7.2.22)  is  obtained  from  the  following 
lemma.  ■ 

Lemma  7.3 

plim  Vf(Eu  -  UU'/T)  =  0. 


This  lemma  not  only  implies  consistency  of  Su  but  also  shows  that  the 
asymptotic  distribution  of 

VT  vech(A,u  -  Su) 

is  the  same  as  that  of 


y/Tvech (T~1UU'  -  Su). 


298 


7  Estimation  of  Vector  Error  Correction  Models 


In  other  words,  it  is  independent  of  the  other  coefficients  of  the  system  and 
has  the  form  given  in  (7.2.22)  (see  also  Section  3.4,  Proposition  3.4). 


Proof  of  Lemma  1. 3: 

Su  =  T~l(AY  -  ap'r_!  -i'AX)(AY-a^'Y_1  -TAX)' 
=  T"1[[/  +  (n-aP')>/-i  +  (r-f)Z\A"] 
x  [u  +  (n  -  ajy)  r_i  +  (r  -  f  )ax}' 

uu'  +  {n-mY-^  +  UPp<n  -®Y 


T 


T 


T 


+  (n-5p')^^(n-5p')' 


+  (n  -  ap') 


T 

;,Y-1  AX' 


T 


—  ~  A.X’Y1  ~ 

(r  -  r)'  +  (r  -  r)^^(n  -  ap')' 


+  (r-f)^  +  ^(r-fr 
+  (r-f)^^(r-f)'. 

Using  aP'  II  =  Op(T-1/2),  r  —  r  =  Op(T-1/2)  and  the  results  in  Lemma 
7.2,  we  get 

^(r-f)^^(r-f)'  =  0p(i), 

~  Y  A  Yr  _ 

Vf(n  -  ap')^^ — (r  -  r)'  =  0p(i), 

and 

Vf(n  -  sp')^^(n  -  5p')'  =  0p(i). 

Thus,  Lemma  7.3  is  proven  if  we  can  show  that 

Vf(d$/-n)YX^  =  0p(i). 


(7.2.23) 


To  prove  this  result,  we  define  a(P)  to  be  the  ML  estimator  of  a  given  P  and 
note  that 

Vr(ap'-n)^^  =  ^[ap'-5(p)p']^^ 

+  Vr[5(P)-a]^^. 

This  quantity  converges  to  zero  in  probability  by  Lemma  7.2(4),  the  fact  that 
\/T[a(P)  —  a]  =  Op(  1)  (see  (7.2.8))  and  because  VTfaP'  —  a(P)p']  =  op(l). 
We  leave  the  latter  result  as  an  exercise  (see  Problem  7.3).  ■ 
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7.2.4  Including  Deterministic  Terms 


So  far  we  have  assumed  that  there  are  no  deterministic  terms  in  the  data 
generation  process,  to  simplify  the  exposition.  In  practice,  such  terms  are 
typically  needed  for  a  proper  representation  of  the  data  generation  process.  It 
turns  out,  however,  that  they  can  be  easily  accommodated  in  the  estimation 
procedures  for  VECMs  discussed  so  far,  if  the  setup  of  Section  6.4  is  used. 
Suppose  the  observed  process  yt  can  be  represented  as 


Dt  =  yt  +  xt, 


(7.2.24) 


where  xt  is  a  zero  mean  process  with  VECM  representation  as  in  (7.2.1)  and 
stands  for  the  deterministic  term.  In  general,  the  latter  term  may  consist 
of  polynomial  trends,  seasonal  and  other  dummy  variables  as  well  as  constant 
means.  As  in  Section  6.4,  we  can  then  set  up  the  VECM  for  the  observed  yt 
variables  as 


Ayt 


a[P'  :  rf] 


Vt-i 

T)co 

ut- 1 


+  +  •  •  •  +  rp_iZ\?/t_p+i  +  CDt  +  ut 


n+ytli  +  TiAyt-i  +  •  •  •  +  Tp-iAyt-p+i  +  CDt  +  ut ,  (7.2.25) 


where  D^°  contains  all  the  deterministic  terms  which  are  present  in  the  coin¬ 
tegration  relations,  Dt  contains  all  remaining  deterministics,  and  i{  and  C 
are  the  corresponding  parameter  matrices.  Moreover,  II+  :=  a[p/  :  rf]  =  ap+/ 
and 


vt  ■= 


yt 

Dct° 


Notice  that  we  assume  that  a  specific  deterministic  term  appears  only  once, 
either  in  D^°  or  in  Dt. 

Now  we  can  simply  modify  the  matrices  used  for  representing  the  estima¬ 
tors  in  the  previous  subsections  and  then  use  basically  the  same  formulas  as 
before  for  computing  the  estimators.  For  example,  defining 


[2/o~ i  •  •  •  iVt- i]i 


r+  :=  [r1,...,rp_1,c,]J 

and 


Ayt-i 


AX+  :=  [AX+, . . . ,  AX+_x]  with  AX+_x  |= 


Ayt-p+i 

Dt 


gives  the  LS  estimator 
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[n+  :  f +]  =  [AYY+l  :  AYAX+'} 


Y^Y+I  Y+1AX+I 
AX+Y+(  AX+AX+' 


1 


The  EGLS  or  ML  estimators  may  be  obtained  analogously. 

Hence,  the  computation  of  the  estimators  is  equally  easy  as  in  the  case 
without  deterministic  terms.  Also,  the  asymptotic  properties  of  the  parame¬ 
ter  estimators  are  essentially  unchanged.  The  asymptotic  theory  for  the  deter¬ 
ministic  terms  requires  some  care,  however,  because  their  convergence  rates 
depend  on  the  specific  terms  included.  For  instance,  if  linear  trends  are  in¬ 
cluded,  the  convergence  rates  of  the  associated  slope  parameters  are  different 
from  x/T.  Generally,  if  the  VECM  is  specified  properly,  including  the  coin¬ 
tegrating  rank  r,  and  if  EGLS  or  ML  methods  are  used,  the  usual  inference 
methods  are  available.  In  particular,  likelihood  ratio  tests  for  parameter  re¬ 
strictions  related  to  the  deterministic  terms  permit  standard  y2  asymptotics 
(see,  e.g.,  Johansen  (1995)). 

A  question  of  interest  in  this  context  is,  for  example,  whether  a  particular 
deterministic  term  can  indeed  be  constrained  to  the  cointegration  relations  or 
needs  to  be  maintained  in  unrestricted  form  in  the  model.  The  z-th  component 
of  Dt  can  be  absorbed  in  the  error  correction  term  if  the  z-th  column  of  the 
coefficient  matrix  C,  denoted  by  Ci,  satisfies  C)  =  aiy  for  some  r-dimensional 
vector  ly.  Thus,  the  relevant  null  hypothesis  is 


a  'j_Ci  =  0. 


In  other  words,  there  are  K—r  restrictions  for  each  component  that  is  confined 
to  the  cointegration  relations.  They  are  easy  to  test  by  a  likelihood  ratio  test 
because  the  ML  estimators  and,  hence,  the  likelihood  maxima  are  easy  to 
obtain  for  both  the  restricted  and  unrestricted  model  by  just  specifying  the 
terms  in  D^°  and  Dt  accordingly.  If  m  deterministic  components  are  restricted 
to  the  cointegration  relations,  the  LR  statistic  has  an  asymptotic  y2  ( m(K  — 
redistribution  under  our  usual  assumptions. 


7.2.5  Other  Estimation  Methods  for  Cointegrated  Systems 

Some  other  estimation  methods  for  cointegration  relations  and  VECMs  have 
been  proposed  in  the  literature.  For  example,  other  systems  methods  for 
estimating  the  cointegrating  parameters  were  considered  by  Phillips  (1991) 
who  discussed  nonparametric  estimation  of  the  short-run  parameters.  Stock 
&  Watson  (1988)  proposed  an  estimator  based  on  principal  components  and 
Bossaerts  (1988)  used  canonical  correlations.  The  latter  two  estimators  were 
shown  to  be  inferior  to  the  ML  estimators  in  a  small  sample  comparison  by 
Gonzalo  (1994)  and  are  therefore  not  considered  here. 

If  there  is  just  a  single  cointegration  relation,  it  may  also  be  estimated 
by  single  equation  LS.  Suppose  that  p  is  normalized  as  in  (7.1.10)  such  that 
P  =  (1,  P2, . . . ,  P K)'  and  P 'yt  =  ylt  +  P 2y2t  H - +  P Kym-  Hence, 
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2/it  =  liVit  H - h  7 KVKt  +  ec*, 

where  7 j  :=  —  p,  and  ec*  is  a  stable,  stationary  process.  Defining 


2/n  ’ 

2/21  • 

•  •  VK\ 

: 

and  Y(2)  := 

.  yiT  \ 

.  2/2  T  • 

•  •  Vkt  _ 

the  LS  estimator  for  7'  :=  (72, . . .  ,7if)  is 
^  =  y,(Dy(2)(^/2)^2))“1- 

Stock  (1987)  showed  that  7  is  superconsistent  and,  more  precisely,  T( 7  —  7) 
converges  in  distribution.  Thus,  7  —  7  =  Op(T_1).  However,  there  is  some  evi¬ 
dence  that  7  is  biased  in  small  samples  (Phillips  &  Hansen  (1990)).  Therefore, 
using  LS  estimation  of  the  cointegration  parameters  without  any  correction 
for  further  dynamics  in  the  model  is  not  recommended. 

A  large  number  of  single  equation  estimators  for  cointegration  relations 
were  reviewed  and  compared  by  Caporale  &  Pittis  (2004).  In  addition  to  the 
simple  LS  estimator  presented  in  the  foregoing,  they  also  considered  estima¬ 
tors  which  are  corrected  for  short-run  dynamics.  For  example,  this  may  be 
accomplished  by  including  leads  and  lags  of  the  differenced  regressor  vari¬ 
ables  in  the  estimation  equation  (e.g.,  Stock  &  Watson  (1993))  or  by  adding 
also  lagged  differences  of  the  dependent  variable  (e.g.,  Banerjee,  Dolado,  Gal¬ 
braith  &  Hendry  (1993),  Wickens  &  Breusch  (1988)).  Another  possible  choice 
in  this  context  is  the  fully  modified  estimator  of  Phillips  &  Hansen  (1990) 
which  takes  care  of  the  short-run  dynamics  nonparametrically  and  a  semi- 
parametric  variant  of  this  estimator  proposed  by  Inder  (1993).  In  addition, 
Caporale  &  Pittis  (2004)  presented  a  large  number  of  modifications.  Some  of 
these  estimators  have  rather  undesirable  small  sample  properties  compared 
to  the  systems  ML  estimator  presented  in  Section  7.2.3.  Even  those  modifica¬ 
tions  that  lead  to  small  sample  improvements  were  only  shown  to  work  in  a 
rather  limited  framework.  Also,  of  course,  some  of  these  estimators  are  only 
designed  for  situations  where  only  one  cointegration  relation  exists. 


Two- Stage  Estimation 

Generally,  if  a  superconsistent  estimator  p  of  the  cointegration  matrix  p  is 
available,  this  estimator  may  be  substituted  for  the  true  p  and  all  the  other 
parameters  may  be  estimated  in  a  second  stage  from 

Ayt  =  ocP  'yt-i  +  riZh/t_i  +  •  •  •  +  Tp_1Ayt_p+i  +  u *t,  (7.2.26) 

where  deterministic  terms  are  again  ignored  for  simplicity.  If  no  restrictions 
are  imposed  on  a  and  the  IYs  (i  =  1, ...  ,p  —  1),  LS  estimation  can  be  used 
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without  loss  of  asymptotic  efficiency.  Denoting  the  two-stage  estimators  of  a 
and  r  by  a2s  and  r2s,  respectively,  we  have 

a2s  =  AYMYL&  (jj'lGiMYlJj) 

(7.2.27) 

and 

f2.s  =  (AY  -  u2$Y_1)AX,(AXAX')-1, 

(7.2.28) 

where  the  notation  from  the  previous  subsections  has  been  used.  For  these 
estimators  the  following  proposition  holds,  which  is  stated  without  proof. 

Proposition  7.5  ( Asymptotic  Properties  of  the  Two-Stage  LS  Estimator) 
Let  yt  be  a  A'-dimensional,  cointegrated  process  with  VECM  representation 
(7.2.1).  Then  the  two-stage  estimator  is  consistent  and 

VT vec([a2s  :  f2s]  -  [a  :  r])  4 Af(0,  J7a,r)t  (7-2.29) 

where  L'a.r  is  the  same  covariance  matrix  as  in  (7.2.8).  ■ 

The  proposition  implies  that  if  a  superconsistent  estimator  of  the  cointe¬ 
gration  matrix  p  is  available,  the  loading  coefficients  and  short-run  parame¬ 
ters  of  the  VECM  can  be  estimated  by  LS  and  these  estimators  have  the  same 
asymptotic  properties  we  would  obtain  by  using  the  true  p.  Thus,  standard 
inference  procedures  can  be  used  for  the  short-run  parameters.  An  analogous 
result  is  also  available  for  VECMs  with  parameter  restrictions  (see  Section  7.3 
for  the  extension). 

The  second  stage  in  the  procedure  may  be  modified.  For  instance,  one 
may  just  be  interested  in  the  first  equation  of  the  system.  In  this  case,  the 
first  equation  may  be  estimated  separately  without  taking  into  account  the 
remaining  ones.  Thus,  the  two-stage  procedure  may  be  applied  in  a  single 
equation  modelling  context. 

Results  similar  to  those  in  Proposition  7.5  were  derived  by  many  authors 
(see,  e.g.,  Stock  (1987),  Phillips  &  Durlauf  (1986),  Park  &  Phillips  (1989), 
and  Johansen  (1991)).  Generally  there  has  been  a  considerable  amount  of  re¬ 
search  on  estimation  and  hypothesis  testing  in  systems  with  integrated  and 
cointegrated  variables.  For  instance,  Johansen  (1991),  Johansen  &  Juselius 
(1990),  and  Liitkepohl  &  Reimers  (1992b)  considered  estimation  with  restric¬ 
tions  on  the  cointegration  and  loading  matrices;  Park  &  Phillips  (1988,  1989) 
and  Phillips  (1988)  provided  general  results  on  estimating  systems  with  in¬ 
tegrated  and  cointegrated  exogenous  variables;  Stock  (1987)  considered  a  so- 
called  nonlinear  LS  estimator,  and  Phillips  &  Hansen  (1990)  discussed  instru¬ 
mental  variables  estimation  of  models  containing  integrated  variables. 

7.2.6  An  Example 

As  an  example,  we  use  the  bivariate  system  of  quarterly,  seasonally  unadjusted 
German  long-term  interest  rate  (Rt  =  y\t)  and  inflation  rate  ( Dpt  =  y2t) 
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which  was  also  analyzed  in  Lutkepohl  (2004).  The  sample  period  is  the  second 
quarter  of  1972  to  the  end  of  1998.  Thus  we  have  T  =  107  observations.  The 
data  are  available  in  File  E6  and  the  two  time  series  are  plotted  in  Figure 
7.1.  Preliminary  tests  indicated  that  both  series  have  a  unit  root  and  there 
are  also  theoretical  reasons  for  a  cointegration  relation  between  them.  The 
so-called  Fisher  effect  implies  that  the  real  interest  rate  is  stationary.  Because 
Rt  is  a  nominal  yearly  interest  rate  while  Dpt  is  a  quarterly  inflation  rate, 
one  would  therefore  expect  Rt  —  4 Dpt  to  be  stationary,  that  is,  this  relation 
is  expected  to  be  a  cointegration  relation. 


R  Dp 

cn  m 


Fig.  7.1.  Seasonally  unadjusted,  quarterly  German  interest  rate  (left)  and  inflation 
rate  (right),  1972.2-1998.4. 


We  have  fitted  a  VECM  with  a  constant,  seasonal  dummy  variables,  and 
three  lagged  differences  and  the  pre-specified  cointegration  relation  Rt  —  4Dpt 
to  the  data.  The  results  are  shown  in  Table  7.1.  Notice  that  three  lagged 
differences  in  the  VECM  imply  a  model  with  four  lags  in  the  levels.  Includ¬ 
ing  at  least  lags  of  one  year  seems  plausible  because  the  inflation  series  has 
a  strong  seasonal  pattern  (see  Figure  7.1).  Formal  statistical  procedures  for 
determining  the  lag  length  will  be  discussed  in  the  next  chapter.  The  seasonal 
movement  in  Dpt  is  also  the  reason  for  including  seasonal  dummy  variables  in 
addition  to  a  constant.  The  deterministic  term,  Dt  =  (1,  sit,  S2t,  S3t)',  where 
the  Sit  are  seasonal  dummy  variables,  is  placed  outside  the  cointegration  re¬ 
lation.  We  have  also  estimated  a  VECM  with  cointegrating  rank  r  =  1  using 
the  reduced  rank  ML  procedure  and  the  EGLS  method.  The  estimates  are 
also  shown  in  Table  7.1. 

The  two  estimated  cointegration  relations  are 

Rt  -  3.96  Dpt  =  ec™L  (7.2.30) 

(0.63) 

and 

Rt  —  3.63  Dpt  =  ecfGLS , 

(0.61) 


(7.2.31) 
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Table  7.1.  Estimated  VECMs  for  interest  rate/inflation  example  system 


known  (3 


ML  estimator  EGLS  estimator 


-0.10 

-0.10 

-0.14 

(-2.3) 

(-2.3) 

(-2.8) 

0.16 

0.16 

0.14 

(3.8) 

(3.8) 

(2.9) 

P'  [1  :  -4] 


0.27 

-0.21 

(2.7) 

(-1.4) 

0.07 

-0.34 

(0.7) 

(-2.4) 

-0.02 

-0.22 

(-0.2) 

(-1.8) 

-0.00 

-0.39 

(-0.0) 

(-3.4) 

0.22 

-0.11 

(2.3) 

(-1.3) 

0.02 

-0.35 

(0.2) 

(-4.5) 

1.00  :  -3.96 

(-6.3) 


0.27 

-0.21 

(2.7) 

(-1.4) 

0.07 

-0.34 

(0.7) 

(-2.4) 

-0.02 

-0.22 

(-0.2) 

(-1.8) 

-0.00 

-0.39 

(-0.0) 

(-3.4) 

0.22 

-0.11 

(2.3) 

(-1.3) 

0.02 

-0.35 

(0.2) 

(-4.5) 

1.00  :  -3.63 
(-6.0) 


0.29 

-0.16 

(2.9) 

(—11) 

0.08 

-0.31 

(0.8) 

(-2.2) 

0.01 

-0.19 

(0.1) 

(-1.6) 

0.01 

-0.37 

(0.1) 

(-3.2) 

0.26 

-0.09 

(2.6) 

(-l.i) 

0.04 

-0.34 

(0.4) 

(-4.4) 

0.001 

0.010  1 

0.002 

0.010  1 

0.005 

0.012  " 

(0.4) 

(3.0) 

(0.4) 

(3.0) 

(1.2) 

(3.1) 

0.001 

-0.034 

0.001 

-0.034 

0.001 

-0.034 

C' 

(0.3) 

(-7.5) 

(0.3) 

(-7.5) 

(0.3) 

(-7.5) 

0.009 

-0.018 

0.009 

-0.018 

0.009 

-0.018 

(1.8) 

(-3.8) 

(1.8) 

(-3.8) 

(1.8) 

(-3.8) 

-0.000 

-0.016 

-0.000 

-0.016 

-0.000 

-0.016 

.  (-0.1) 

(-3.6)  J 

.  (-0.1) 

(-3.6)  J 

.  (-0.1) 

(-3.6)  _ 

Note:  t-values  in  parentheses  underneath  parameter  estimates;  deterministic  terms: 
constant  and  seasonal  dummies  ( Dt  =  (1,  sit,  S2t,  S3t)')- 


where  estimated  standard  errors  are  given  in  parentheses.  The  first  coefficient 
is  normalized  to  be  1.  Thereby  the  i-ratios  and  the  standard  errors  of  the 
inflation  coefficient  can  be  interpreted  in  the  usual  way.  Clearly,  —4  is  well 
within  a  two-standard  error  interval  around  both  estimates.  Therefore  one 
could  argue  that  restricting  the  inflation  coefficient  to  4  is  in  line  with  the 
data.  Using  the  result  in  Proposition  7.2,  a  formal  test  of  the  null  hypothesis 
Ho  :  P2  =  — 4,  where  (32  denotes  the  second  component  of  (3,  can  be  based  on 
the  ^statistic 


-3.96  -  (-4) 
0.63 


0.06 


for  the  ML  estimator  or  on 


-3.63  -  (-4) 


0.61 


0.61 
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for  the  EGLS  estimator.  Both  t-values  are  small  compared  to  critical  values 
from  the  standard  normal  distribution  corresponding  to  usual  significance 
levels.  Hence,  the  null  hypothesis  cannot  be  rejected  for  either  of  the  two 
estimators. 

Comparing  the  other  estimates  of  the  three  models  in  Table  7.1,  it  is  ob¬ 
vious  that  corresponding  estimates  do  not  differ  much,  especially  when  the 
sampling  uncertainty  reflected  in  the  t-ratios  is  taken  into  account.  In  par¬ 
ticular,  the  ML  estimates  are  very  close  to  those  of  the  model  with  fixed 
cointegration  vector.  Thus,  imposing  the  theoretically  expected  cointegration 
vector  does  not  appear  to  be  a  problematic  constraint. 

Another  observation  that  can  be  made  in  Table  7.1  is  that  there  are  some 
insignificant  coefficients  in  the  short-run  matrices  T $  and  the  estimated  deter¬ 
ministic  terms  (C).  Because  some  of  the  parameters  in  T3  have  rather  large 
f-ratios,  it  is  clear  that  simply  reducing  the  lag  order  is  not  likely  to  be  a  good 
strategy  for  reducing  the  number  of  parameters  in  the  model.  It  makes  sense, 
however,  to  consider  restricting  some  of  the  parameter  values  to  zero.  This 
issue  is  discussed  in  the  next  section. 


7.3  Estimating  VECMs  with  Parameter  Restrictions 

As  for  other  models,  restrictions  may  be  imposed  on  the  parameters  of  VECMs 
to  increase  the  estimation  precision.  We  will  first  discuss  restrictions  for  the 
cointegration  relations  and  then  turn  to  restrictions  on  the  loading  coefficients 
and  short-run  parameters. 

7.3.1  Linear  Restrictions  for  the  Cointegration  Matrix 

In  case  just-identifying  restrictions  for  the  cointegration  relations  are  available, 
estimation  may  proceed  as  described  in  Section  7.2  and  then  the  identified 
estimator  of  (I  may  be  obtained  by  a  suitable  transformation  of  the  estimator 
p.  For  example,  if  p  is  just  a  single  vector  and  ML  estimation  is  used,  a 
normalization  of  the  first  component  may  be  obtained  by  dividing  the  vector 
P  by  its  first  component,  as  discussed  earlier. 

Sometimes  over-identifying  restrictions  are  available  for  the  cointegration 
matrix.  In  general,  if  the  restrictions  can  be  expressed  in  the  form 

vec(P'(Jf_r))  =  Ry+  r,  (7.3.1) 

where  R  is  a  fixed  ( r(K  —  r)  x  m)  matrix  of  rank  to,  r  is  a  fixed  r(K  —  r)- 
dimensional  vector,  and  y  is  a  vector  of  free  parameters,  the  EGLS  estimator 
is  still  available.  The  GLS  estimator  may  be  obtained  from  the  vectorized 
“concentrated  model”  (7.2.14), 

vec(R0  —  aR^)  =  (R^'  <g>  oc)vec(P'(K_r))  +  vec(£/*) 

=  (f?i2)'  <g>  a)(Ry  +  r)  +  vec ([/*), 
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so  that 

vec(i?o  -  a -  (i42),(g>  a)r  =  (i?f: y  <g>  a)Ry  +  vec(CT).  (7.3.2) 

Thus,  the  GLS  estimator  for  y  is 

y  =  [R'fRfRf^a'^'aJRl 

xR'(ii|2)  ®  a'T1-1)  [vec(i?0  -  a r[1])  -  (i?fv  ®  a)rj  . 

Substituting  consistent  estimators  a  and  Eu  for  a  and  Uu,  respectively,  gives 
the  EGLS  estimator 

?  =  [rxrM25'®^1^]”1  (7  3  3) 

xR'(i?{2)  ®  a  S-1)  jvec(i?0  -  aR^)  -  (i^2)/  ®  a)rj  . 

Extending  the  arguments  used  for  proving  Proposition  7.2,  the  following 
asymptotic  properties  of  the  EGLS  estimator  can  be  shown. 

Proposition  7.6  ( Asymptotic  Properties  of  the  Restricted  EGLS  Estimator) 
Suppose  yt  is  generated  by  the  VECM  (7.2.1)  and  (3  satisfies  the  restrictions 
in  (7.3.1).  Then 

1  1  /2  x-v 

R,(42)42),®S'i:-1a)Rj  (y-y)  4aA(0,/to).  (7.3.4) 


Thus,  standard  inference  procedures  can  be  based  on  the  transformed  es¬ 
timator.  It  can  also  be  shown  that  y  —  y  =  Op(T_1).  In  other  words,  the 
estimator  is  superconsistent.  Clearly,  consistent  estimators  of  a  and  Eu  are 
readily  available  from  unrestricted  LS  estimation  as  in  Section  7.2.2. 

Defining  p fK_r\  such  that  vec  $fK_r\  =  Ry  +  r, 


J, 


I 


R 

(K-r) 


is  a  restricted  estimator  of  the  cointegration  matrix.  It  can,  for  example,  be 
used  in  the  two-stage  procedure  described  in  Section  7.2.5. 

If  the  restrictions  for  the  cointegration  matrix  can  be  written  in  the  form 
P  =  Hip,  where  H  is  some  known,  fixed  (K  x  s)  matrix  and  ip  is  (s  x  r)  with 
s  >  r,  ML  estimation  is  also  straightforward.  For  example,  in  a  system  with 
three  variables  and  one  cointegration  relation,  if  P31  =  —  P21,  we  have 


Hip, 
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where  p  :=  ((3n ,  P2i )/  and  H  is  defined  in  the  obvious  way.  If  the  restric¬ 
tions  can  be  represented  in  this  form,  Y-\  is  simply  replaced  by  H'Y-i  in 
the  quantities  entering  the  eigenvalue  problem  in  Proposition  7.3.  Denoting 
the  resulting  estimator  by  p  gives  a  restricted  estimator  (3  =  Hp  for  (3  and 
corresponding  estimators  of  a  and  T  as  in  Proposition  7.3.  If  the  restrictions 
in  (7.3.1)  can  be  written  in  this  form,  the  EGLS  and  the  ML  estimators  have 
again  identical  asymptotic  properties. 

However,  the  restrictions  in  (7.3.1)  can  in  general  not  be  written  in  the 
form  (3  =  Hip.  For  instance,  if  there  are  three  variables  ( K  =  3)  and  two 
cointegrating  relations  (r  =  2),  a  single  zero  restriction  on  the  second  coin¬ 
tegration  vector  cannot  be  expressed  in  the  form  (3  =  Hip,  whereas  it  may 
still  be  written  in  the  form  (7.3.1).  Moreover,  it  may  be  expressed  in  the  form 
(3  =  [Hipi,  H2P2]  with  suitable  matrices  Hi  and  H2  and  vectors  p\  and  p-2- 
For  example,  if  a  zero  restriction  is  placed  on  the  last  element  of  the  second 
cointegrating  vector,  we  get 

P11  P12 

P  =  P21  P22  =  [Hipi,H2p2] 

P.31  0 

with  Hi  :=  La,  pi  :=  ((3n,  |321,  (331)', 

"  1  0 

H2:=  0  1 

0  0 

and  P2  :=  ((312,  (322)'.  that  case,  restricted  ML  estimation  is  still  not  difficult 
but  requires  an  iterative  optimization  (see  Boswijk  &  Doornik  (2002)). 

7.3.2  Linear  Restrictions  for  the  Short-Run  and  Loading 
Parameters 

If  a  superconsistent  estimator  of  the  cointegration  matrix  (3  is  available,  the 
two-stage  procedure  described  in  Section  7.2.5  can  be  used  for  estimating  the 
loading  and  short-run  parameters  of  a  VECM.  The  method  can  be  readily 
extended  to  models  with  parameter  restrictions.  Suppose  linear  restrictions  of 
the  form 

vec [a  :  T]  =  ftp,  (7.3.5) 

where  3?  is  a  fixed  (K  (r  +  K(jp  —  1))  x  n)  matrix  and  p  is  an  n-dimensional 
vector.  Then  we  can  write  the  model  in  matrix  form  as 

^=[a:r]  VJ-1  +u* 


and  in  vectorized  form  we  get 
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vec(AY)  =  :  AX']  0  IK)  vec[a  :  T]  +  vec(£T) 

=  ([Y^P  :  AX']  ®  IK)  M<p  +  vec (17*). 

Hence,  the  GLS  estimator  of  <p  is 

®  if"1  j  3? 

),  (7.3.6) 

from  which  an  EGLS  estimator  Ip  is  obtained  by  replacing  the  residual  co- 
variance  matrix  Su  by  a  consistent  estimator.  The  latter  estimator  may,  for 
example,  be  obtained  from  an  unrestricted  estimation  of  the  model.  The  re¬ 
sulting  EGLS  estimator  has  the  following  asymptotic  properties. 

Proposition  7.7  ( Asymptotic  Properties  of  the  Restricted  EGLS  Estimator 
of  the  Short-Run  Parameters ) 

Suppose  yt  is  generated  by  the  VECM  (7.2.1),  p  is  a  superconsistent  estima¬ 
tor  of  P,  Eu  is  a  consistent  estimator  of  £u,  and  the  short-run  and  loading 
parameters  satisfy  (7.3.5).  Then 


= 


3?'  f 

i 

"CO) 

Fs 

)XO> 

P'YliZiX' 

L  v 

AXYfx  p 

AX  AX' 

x3?' 


P'y-i 

AX 


S. 


-l 


vec(Z\L 


VT(p-p) 

—>  H  fo,plim  T 


3?' 


P'r-iV^iP  p  ,Y-1AX’ 
AXYfJi  AX  AX' 


(7.3.7) 


We  do  not  prove  the  proposition  but  just  note  that  it  follows  from  the 
fact  that  only  stationary  variables  are  involved  if  p  is  replaced  by  the  true 
cointegration  matrix  P  and  the  resulting  estimator  for  <p  differs  from  tp  by  a 
quantity  which  is  op(T~1/2).  Moreover,  the  asymptotic  normal  distribution  of 

vec  [a  :  T]  =  3?£>  follows  in  the  usual  way. 

It  is  straightforward  to  extend  these  result  to  the  case  where  the  restric¬ 
tions  are  of  the  form 

vec[a  :  r]  =  3?<p  +  r,  (7.3.8) 

where  r  is  now  a  fixed  ( K(r  +  K(p  —  1))  x  1)  vector  (see  Problem  7.6).  The 
more  special  restrictions  in  (7.3.5)  are  considered  here  for  convenience  and 
because  they  cover  most  cases  of  practical  importance. 
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7.3.3  An  Example 

In  Section  7.2.6,  we  have  seen  that  in  the  short-run  dynamics  of  the  German 
interest  rate/inflation  example  models  a  number  of  coefficients  have  quite  low 
f-ratios  (see  Table  7.1).  Therefore  it  makes  sense  to  restrict  some  of  the  coef¬ 
ficients  to  zero.  The  following  model  from  Liitkepohl  (2004,  Equation  (3.41)) 
for  our  data  set  is  an  example  of  a  restricted  (subset)  VECM: 


r  -0.07  i 

ARt 

(-3.1) 

ADpt 

0.17 

(4-5)  . 

r 

0.24 

-0.08 

1 

(2.5) 

(-1.9) 

0 

-0.31 

- 

(-2.5) 

- 

0.20 

i  -0.06 

1 

(2.1) 

(-1.6) 

0 

-0.34 

(-4.7) 

0  0 

0.010  -0.034 

(3.0)  (-7.6) 


{Rt- 1  -  4:Dpt-l) 


ARt-  i 

ADpt_i 


ARt- 3 
ADpt_  3 


0.010  0 

(2.8) 

-0.018  -0.016 

(-3.8)  (-3.6) 


-0.13 

(-2.5) 

-0.37 

(-3.6) 


ARt- 2 

ADpt-2 


-| 

r c  i 

Si,t 

+ 

Ul,t 

S2,t 

.  “2,t  _ 

"I 

.  S3,t  . 

(7.3.9) 


27,,. 


2.61  -0.15 
-0.15  2.31 


x  10"5. 


Here  we  have  used  the  fixed  cointegration  vector  that  was  found  in  Section 
7.2.6  and  EGLS  estimation  of  the  loading  coefficients  and  short-term  parame¬ 
ters  is  used,  f-ratios  are  again  given  in  parentheses  underneath  the  parameter 
estimates.  They  are  all  relatively  large.  In  fact,  with  two  exceptions  they  are 
all  larger  than  two.  Recall  that  f-ratios  can  be  interpreted  in  the  usual  way  as 
asymptotically  standard  normally  distributed  by  Proposition  7.7.  Comparing 
the  model  (7.3.9)  to  those  in  Table  7.1,  it  turns  out  that  the  parameters  with 
very  small  f-ratios  in  the  unrestricted  models  are  just  the  ones  restricted  to 
zero  in  (7.3.9).  The  model  was  actually  found  by  a  sequential  model  selection 
procedure  which  will  be  discussed  in  the  next  chapter. 


7.4  Bayesian  Estimation  of  Integrated  Systems 

It  is  also  possible  to  place  Bayesian  restrictions  on  VECMs.  A  very  important 
constraint  in  these  models  is  the  cointegrating  rank,  however.  In  Bayesian 
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analysis,  a  basic  idea  is  to  allow  the  data  to  revise  the  prior  restrictions  im¬ 
posed  by  the  analyst.  Using  this  principle  also  for  the  unit  roots  and,  hence,  for 
the  cointegration  relations,  setting  up  the  system  in  VECM  form  may  not  be 
the  most  plausible  approach  anymore.  Therefore,  Bayesian  restrictions  have 
often  been  imposed  on  the  levels  VAR  form,  even  if  the  variables  are  possi¬ 
bly  integrated.  A  popular  prior  in  this  context  is  the  Minnesota  or  Litterman 
prior  which  ignores  possible  cointegration  between  the  variables  altogether. 
We  will  present  this  prior  in  the  following  after  the  general  setting  has  been 
discussed. 

7.4.1  The  Model  Setup 

In  Chapter  5,  Section  5.4,  we  have  discussed  Bayesian  estimation  of  stationary, 
stable  VAR(p)  processes.  For  a  Gaussian  process  with  integrated  variables 
and  a  normal  prior,  the  posterior  distribution  of  the  VAR  coefficients  can  be 
derived  in  a  similar  manner.  We  now  consider  a  levels  VAR(p)  model  of  the 
form 


Vt  —  v  +  Aiyt-i  +  •  •  •  +  Apijt-p  +  lit- 

As  usual,  (3  :=  vec[jy  A1? . . . ,  Ap]  is  the  vector  of  VAR  coefficients  including 
an  intercept  vector  and  we  assume  a  prior 

(7.4.1) 

Then,  using  the  same  line  of  reasoning  as  in  Section  5.4,  the  posterior  mean 
is 


P  =  [Vp1  +  {ZZ'  ®  Z-1)}-1^-1?*  +  (Z&Z-1) y] 
and  the  posterior  covariance  matrix  is 
Zp  =  [Vp1  +  (ZZ'®2^)]~1, 
where 


y  :=  vec[j/i , . . . ,  yT\  and  Z  :=  [Z0, . . . ,  ZT_1]  with  Zt 


1 

Vt 

Vt—p+i 


7.4.2  The  Minnesota  or  Litterman  Prior 

A  possible  choice  of  (3*  and  Vp  for  stable  processes  was  discussed  in  Sec¬ 
tion  5.4.3.  If  the  variables  are  believed  to  be  integrated,  the  following  prior 
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discussed  by  Doan  et  al.  (1984)  and  Litterman  (1986),  sometimes  known  as 
Minnesota  prior,  could  be  used:  (1)  Set  the  prior  mean  of  the  first  lag  of  each 
variable  equal  to  one  in  its  own  equation  and  set  all  other  coefficients  at  zero. 
In  other  words,  if  the  prior  means  were  the  true  parameter  values  each  vari¬ 
able  were  a  random  walk.  (2)  Choose  the  prior  variances  of  the  coefficients  as 
in  Section  5.4.3.  In  other  words,  the  prior  variances  of  the  intercept  terms  are 
infinite  and  the  prior  variance  of  cqyq  the  ij- th  element  of  Ai,  is 


/  (A/7)2  if  i  =  j, 
\  (A Ocri/laj)2  if  i  ^  j, 


where  A  is  the  prior  standard  deviation  of  au^x,  0  <  9  <  1,  and  of  is  the  i-tli 
diagonal  element  of  Su.  Thus,  we  get,  for  instance,  for  a  bivariate  VAR(2) 
system, 

2/it  =  0  +  1  •  yi,t-i  +  0  •  2/2, t-i  +  0  •  2/i, t-2  +  0  •  2/2, t— 2  +m«, 


(oo) 

(A) 

(A0<Ti/<72) 

(A/2) 

(A0(Ti/2cr2) 

0 

+  0  •  2/i, t-i 

+  1  •  2/2, t-i 

+  0  •  2/i, t-2 

+  0  •  2/2, t-2  +u2 1, 

(oo) 

(A0<72/<Ti) 

(A) 

{\9a2/2ai) 

(A/2) 

where  all  coefficients  are  set  to  their  prior  means  and  the  numbers  in  parenthe¬ 
ses  are  their  prior  standard  deviations.  Forgetting  about  the  latter  numbers 
for  the  moment,  each  of  these  two  equations  is  seen  to  specify  a  random  walk 
for  one  of  the  variables.  The  nonzero  prior  standard  deviations  indicate  that 
we  are  not  sure  about  such  a  simple  model.  The  standard  deviations  decline 
with  increasing  lag  length  because  more  recent  lags  are  assumed  to  be  more 
likely  to  belong  into  the  model.  The  infinite  standard  deviations  for  the  in¬ 
tercept  terms  simply  reflect  that  we  do  not  have  any  prior  guess  for  these 
coefficients.  Also,  we  do  not  impose  covariance  priors  and,  hence,  choose  Vp 
to  be  a  diagonal  matrix.  Its  inverse  is 


A2 


(A0CT2)2 


(A  0<7!)2 


A2 


2_ 

A2 


(XOa?,)2 


2 

(A  0<7i)2 


2 

A2  J 


where  0  is  also  substituted  for  the  inverse  (infinite)  standard  deviation  of  the 
intercept  terms. 
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To  compute  /3  requires  the  inversion  of  Vp1  +  (ZZ1  ®  A"1).  Because  this 
matrix  is  usually  quite  large,  in  the  past,  Bayesian  estimation  has  often  been 
performed  separately  for  each  of  the  K  equations  of  the  system.  In  that  case, 

h  =  [iv1  +  +  °k2zm) 

is  used  as  an  estimator  for  the  parameters  bk  of  the  fe-th  equation,  that  is,  b'k 
is  the  fc-th  row  of  B  :=  [u,  Ai, . . . ,  Ap\.  Here  Vjc  is  the  prior  covariance  matrix 
of  bk,  b*k  is  its  prior  mean,  and  y (*,)  :=  (yki,  •  •  •  ,ykr)' ■  As  in  Chapter  5,  ak  is 
replaced  by  the  fc-th  diagonal  element  of  the  ML  estimator 

Su  =  Y(It  -  Z'{ZZ')-1Z)Y' /T 

of  the  white  noise  covariance  matrix. 

Clearly,  in  this  prior,  possible  cointegration  between  the  variables  is  not 
taken  into  account.  Given  the  growing  importance  of  the  concept  of  cointegra¬ 
tion  in  the  recent  literature,  it  is  perhaps  not  surprising  that  the  Minnesota 
prior  has  lately  lost  some  of  its  appeal.  Bayesians  have  responded  to  the  suc¬ 
cess  of  the  concept  of  cointegration  and  of  VECMs  in  classical  econometrics. 
Some  recent  contributions  to  Bayesian  analysis  of  VECMs  include  Kleibergen 
&  van  Dijk  (1994),  Kleibergen  &  Paap  (2002),  Strachan  (2003),  and  Strachan 
&  Inder  (2004).  A  survey  with  many  more  references  was  given  by  Koop, 
Strachan,  van  Dijk  &  Villani  (2005). 


7.4.3  An  Example 

As  an  example  illustrating  Bayesian  estimation  based  on  the  Minnesota  prior, 
we  consider  the  following  four-dimensional  system  of  U.S.  economic  variables: 

j/i  -  logarithm  of  the  real  money  stock  Ml  (In  Ml), 

2/2  -  logarithm  of  GNP  in  billions  of  1982  dollars  (In  GNP), 

2/3  -  discount  interest  rate  on  new  issues  of  91-day  Treasury  bills  (rs), 

2/4  -  yield  on  long-term  (20  years)  Treasury  bonds  ( rl ). 

Quarterly  data  for  the  years  1954  to  1987  are  used.  The  data  are  available  in 
File  E3.  They  are  plotted  in  Figure  7.2.  The  GNP  and  Ml  data  are  seasonally 
adjusted.  The  variables  rs  and  rl  are  regarded  as  short-  and  long-term  interest 
rates,  respectively.  The  plots  in  Figure  7.2  show  that  the  series  are  trending. 
Thus,  they  may  be  integrated  and,  given  that  this  is  a  small  monetary  system, 
there  may  in  fact  be  cointegration.  For  example,  there  may  be  a  long-run 
money  demand  relation  and  perhaps  the  interest  rate  spread  rl  —  rs  may 
be  a  stationary  variable.  Although  the  system  may  be  cointegrated,  we  will 
consider  the  Minnesota  prior  in  the  following. 

We  have  first  fitted  an  unrestricted  VAR(2)  model  to  the  data  and  present 
the  results  in  Table  7.2.  It  can  be  seen  that  at  least  the  last  three  of  the 
four  diagonal  elements  of  A\  are  estimated  to  be  close  to  1.  The  first  diagonal 
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Fig.  7.2.  U.S.  In  Ml,  In  GNP,  and  interest  rate  time  series. 
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element  is  also  not  drastically  different  from  1,  although  1  is  not  within  a  two- 
standard  error  interval  around  the  estimate.  On  the  basis  of  the  unrestricted 
estimates,  a  prior  with  mean  1  for  the  diagonal  elements  of  A\  does  not  appear 
to  be  unreasonable  for  this  example.  Of  course,  in  a  Bayesian  analysis  the  prior 
is  usually  not  chosen  on  the  basis  of  an  unrestricted  estimation. 


Table  7.2.  VAR(2)  coefficient  estimates  for  the  U.S.  example  system  with  estimated 
standard  errors  in  parentheses 


estimation 

method 

V 

Ai 

A.2 

.028 

1.307 

.106 

-.554 

-.814 

-.318 

-.101 

.318 

1.022 

(.070) 

(.075) 

(.107) 

(.224) 

(.070) 

(.076) 

(.115) 

(.221) 

.129 

.080 

1.045 

-.177 

.473 

-.135 

-.014 

-.197 

-.416 

unrestricted 

(.083) 

(.088) 

(.126) 

(.265) 

(.083) 

(.090) 

(.136) 

(.261) 

LS 

.096 

.193 

.068 

.978 

.284 

-.248 

-.035 

.053 

-.644 

(.077) 

(.081) 

(.116) 

(.245) 

(.077) 

(.083) 

(.125) 

(.240) 

.030 

.042 

.042 

.034 

1.065 

-.064 

-.027 

.070 

-.308 

(.038) 

(.041) 

(.058) 

(.122) 

(.038) 

(.041) 

(.063) 

(.120) 

.041 

1.332 

.098 

-.556 

-.838 

-.346 

-.091 

.354 

.969 

(.067) 

(.073) 

(.104) 

(.216) 

(.064) 

(.073) 

(.110) 

(.207) 

.086 

.071 

1.052 

-.169 

.549 

-.099 

-.039 

-.239 

-.286 

ML 

(.079) 

(.086) 

(.123) 

(.256) 

(.076) 

(.087) 

(.131) 

(.245) 

('•  =  •) 

.005 

.179 

.080 

.991 

.425 

-.181 

-.079 

-.022 

-.405 

(.076) 

(.082) 

(.118) 

(.245) 

(.073) 

(.083) 

(.125) 

(.235) 

-.014 

.037 

.047 

.041 

1.138 

-.032 

-.050 

.033 

-.186 

(.038) 

(.041) 

(.059) 

(.122) 

(.036) 

(.042) 

(.062) 

(.117) 

We  have  estimated  the  system  with  the  Minnesota  prior  and  different 
values  of  A  and  9.  Some  results  for  a  VAR(2)  process  are  given  in  Table  7.3 
to  illustrate  the  effect  of  the  choice  of  the  prior  variance  parameters  A  and  9. 
For  this  particular  data  set,  a  combination  A  =  1  and  9  =  .25  leads  to  mild 
changes  in  the  estimates  only  relative  to  unrestricted  estimates  (A  =  oo,9  = 
1).  Decreasing  9  has  the  effect  of  shrinking  the  off-diagonal  elements  towards 
zero.  Thus,  a  small  9  is  reasonable  if  the  variables  are  expected  to  be  unrelated. 
The  effect  of  a  small  9  is  seen  in  Table  7.3  in  the  panel  corresponding  to  A  =  1 
and  9  =  .01.  On  the  other  hand,  lowering  A  shrinks  the  diagonal  elements  of 
A\  towards  1  and  all  other  coefficients  (except  the  intercept  terms)  towards 
zero.  This  effect  is  clearly  observed  for  A  =  .01,  9  =  .25.  Hence,  if  the  analyst 
has  a  strong  prior  in  favor  of  unrelated  random  walks,  a  small  A  is  appropriate. 

In  practice,  one  would  usually  choose  a  higher  VAR  order  than  2  in  a 
Bayesian  analysis  because  chopping  off  the  process  at  p  =  2  implies  a  very 
strong  prior  with  mean  zero  and  variances  zero  for  A3 ,  A4 ,. .. ,  which  is  a 
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Table  7.3.  Bayesian  estimates  of  the  U.S.  example  system 


prior 

V 

A  r 

A.2 

A  = 

oo 

.028 

1.307 

.106 

-.554 

-.814 

-.318 

-.101 

.318 

1.022 

9  = 

1 

.129 

.080 

1.045 

-.177 

.473 

-.135 

-.014  - 

.197 

-.416 

(unrestricted) 

.096 

.193 

.068 

.978 

.284 

-.248 

-.035 

.053 

-.644 

.030 

.042 

.042 

.034 

1.065 

-.064 

-.027 

.070 

-.308 

.061 

1.307 

.021 

-.514 

-.465 

-.331 

-.009 

.212 

.679 

A  = 

1 

.110 

.060 

1.088 

-.173 

.283 

-.108 

-.060  - 

.162 

-.238 

9  = 

.25 

.078 

.119 

.064 

1.060 

.025 

-.167 

-.034  - 

.069 

-.316 

.029 

.021 

.029 

.050 

1.044 

-.043 

-.014 

.031 

-.265 

.083 

1.550 

.004 

-.012 

-.007 

-.570 

-.002  - 

.000 

.004 

A  = 

1 

-.015 

.005 

1.270 

-.011 

-.011 

-.001 

-.271  - 

.003 

-.002 

9  = 

.01 

-.032 

-.003 

.008 

1.095 

-.001 

-.002 

.002  - 

.216 

-.001 

-.016 

-.003 

.004 

.002 

1.187 

-.001 

.001 

.000 

-.252 

-.045 

1.009 

.002 

-.001 

-.000 

-.003 

.000  - 

.000 

.000 

A  = 

.01 

.018 

.001 

.999 

-.001 

-.002 

.000 

-.002  - 

.000 

-.000 

9  = 

.25 

-.004 

.001  - 

-.000 

.993 

-.001 

.000 

-.000  - 

.002 

-.000 

-.003 

.000 

.000 

.000 

.994 

.000 

.000  - 

.000 

-.002 

bit  unrealistic.  The  above  analysis  is  just  meant  to  illustrate  the  effect  of 
the  parameters  that  determine  the  prior  variances.  Also,  if  the  variables  are 
believed  to  be  cointegrated,  the  Minnesota  prior  is  not  a  good  choice.  It  is 
more  suited  for  a  process  which  has  a  VAR  representation  in  first  differences 
because  the  basic  idea  underlying  this  prior  is  that  the  variables  are  roughly 
unrelated  random  walks.  Notice,  however,  that  for  the  present  system,  if  a 
VECM  with  cointegration  rank  r  =  1  and  one  lagged  difference  is  fitted  by  ML 
and  the  corresponding  levels  VAR  coefficients  are  determined  via  (7.2.10),  the 
estimates  in  the  lower  part  of  Table  7.2  are  obtained.  If  the  system  is  actually 
cointegrated,  the  rank  restriction  should  not  lead  to  major  distortions  in  the 
estimates.  Therefore,  it  should  not  be  surprising  that  the  diagonal  elements  of 
the  ML  estimator  of  A\  are  again  not  far  from  1.  Thus,  even  if  the  variables 
are  cointegrated,  the  Minnesota  prior  may  not  lead  to  substantial  distortions. 
This  property  may  explain  why  the  prior  has  been  used  successfully  in  many 
applications,  in  particular,  for  forecasting  (see  Litterman  (1986)). 


7.5  Forecasting  Estimated  Integrated  and  Cointegrated 
Systems 

As  seen  in  Chapter  6,  Section  6.5,  forecasting  integrated  and  cointegrated 
variables  is  conveniently  discussed  in  the  framework  of  the  levels  VAR  rep¬ 
resentation  of  the  data  generation  process.  Therefore  we  consider  a  VAR(p) 
model, 


Vt  —  Aiyt-i  +  •  •  •  +  Apyt-p  +  ut, 


(7.5.1) 
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with  integrated  and  possibly  cointegrated  variables.  All  symbols  have  their 
usual  meanings  (see  Section  6.5).  Deterministic  terms  are  left  out  for  conve¬ 
nience.  Adding  them  is  a  straightforward  exercise  which  is  left  to  the  reader. 

Replacing  the  coefficients  Ai, . . . ,  Ap ,  and  the  white  noise  covariance  ma¬ 
trix  Su  by  estimators  in  the  forecasting  formulas  of  Section  6.5  creates  similar 
problems  as  in  the  stationary,  stable  case  considered  in  Chapter  3,  Section  3.5. 
Denoting  the  h- step  forecast  based  on  estimated  coefficients  by  yt.(h)  and  in¬ 
dicating  estimators  by  hats  gives 

yt{h)  =  Mytih  -  1)  H - h  Apyt(h  -  p),  (7.5.2) 

where  yt{j)  '■=  Vt+j  for  j  <  0.  For  this  predictor,  the  forecast  error  becomes 

yt+h-ytih)  =  [yt+h  -  yt{h)\  +  [yt{h)  -  yt{h)} 
h- 1 

=  ^2  ®iUt+h-i  +  [yt{h)  -  yt{h)\ ,  (7.5.3) 

i= 0 

where  the  last  equality  sign  follows  from  Eq.  (6.5.4)  in  Chapter  6.  The  last  two 
terms  in  (7.5.3)  are  uncorrelated  if  parameter  estimation  is  based  on  data  up 
to  period  t  only.  In  fact,  under  standard  assumptions,  the  last  term  has  zero 
probability  limit,  yt{h)  —yt(h)  =  op(  1),  as  in  the  stationary  case  (see  Problem 
7.7).  Thus,  the  forecast  errors  from  estimated  processes  and  processes  with 
known  coefficients  are  asymptotically  equivalent.  However,  in  the  present  case, 
the  MSE  correction  for  estimated  processes  derived  in  Section  3.5  is  difficult 
to  justify  (see  Problem  7.8  and  Basu  &  Sen  Roy  (1987)).  This  problem  must 
be  kept  in  mind  when  forecast  intervals  are  constructed.  One  possible  MSE 
estimator  is 

h- 1 

Sy{h)  =  (7.5.4) 

?:=0 

where  the  <£j’s  are  obtained  from  the  estimated  Aj’s  by  the  recursions  in  (6.5.5) 
in  Section  6.5.  This  estimator  is  likely  to  underestimate  the  true  forecast  un¬ 
certainty  on  average  in  small  samples.  Therefore,  there  is  some  danger  that 
the  confidence  level  of  corresponding  forecast  intervals  is  overstated.  Reimers 
(1991)  derived  a  small  sample  correction  especially  for  models  with  cointe¬ 
grated  variables  and  Engle  &  Yoo  (1987)  and  Reinsel  &  Ahn  (1992)  reported 
on  simulation  studies  in  which  imposing  the  cointegration  restriction  in  the 
estimation  gave  better  long-range  forecasts  than  the  use  of  unrestricted  mul¬ 
tivariate  LS  estimators. 


7.6  Testing  for  Granger-Causality 

7.6.1  The  Noncausality  Restrictions 

In  Section  6.6,  we  have  seen  that  the  restrictions  characterizing  Granger- 
noncausality  are  the  same  as  in  the  stable  case.  If  the  levels  VAR(p)  repre- 
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sentation  (7.5.1)  of  the  data  generation  process  is  considered  again  and  the 
vector  yt  is  partitioned  in  M-  and  (K  —  M)-dimensional  subvectors  Zt  and  Xt, 


Vt 


zt 

Xt 


and  Ai  = 


Auti  Ai2,i 

A21  ,i  A224 


where  the  At  are  partitioned  in  accordance  with  the  partitioning  of  yt,  then 
Xt  does  not  Granger-cause  Zt  if  and  only  if  the  hypothesis 


#0  :  Ai2,i  =  0  for  i  =  l,,..,p,  (7.6.1) 

is  true.  Hence,  we  just  have  to  test  a  set  of  linear  restrictions.  A  Wald  test  is  a 
standard  choice  for  this  purpose.  In  the  present  case,  it  may  be  problematic, 
however.  We  will  discuss  the  potential  problem  next  and  then  present  a  mod¬ 
ification  that  has  a  limiting  ^-distribution,  as  usual,  and,  hence,  resolves  the 
problem. 


7.6.2  Problems  Related  to  Standard  Wald  Tests 

If  the  process  is  estimated  by  one  of  the  procedures  described  in  Section 

7.2  such  that  the  estimator  S  of  a  :=  vec[Ai, . . . ,  A.p]  has  the  asymptotic 
distribution  given  in  Corollary  7.1.1,  then  a  Wald  test  can  be  conducted  for 
the  pair  of  hypotheses 

U0  :  Col  =  0  against  Hi  :  Col  ^  0.  (7.6.2) 

Here  C  is  an  ( N  x  pK2)  matrix  of  rank  N.  The  relevant  Wald  statistic  is 

Aw  =  TaC'tCH^C'y'Ca,  (7.6.3) 

where  A™  is  a  consistent  estimator  of  27“ .  The  statistic  Aw  has  an  asymptotic 
%2(-/V)-distribution,  provided  the  null  hypothesis  is  true  and 

rk  {CE£C)  =  rk  (C27“C")  =  N.  (7.6.4) 

This  result  follows  from  standard  asymptotic  theory  (see  Appendix  C.7).  We 
have  chosen  to  state  it  here  again  because  the  rank  condition  (7.6.4)  now  be¬ 
comes  important.  It  is  automatically  satisfied  for  stable,  full  VAR  processes  as 
discussed  in  Chapter  3,  because  in  that  case  the  asymptotic  covariance  matrix 
of  the  coefficient  estimator  is  nonsingular.  Now,  however,  27“  is  singular  if 
the  cointegration  rank  r  is  less  than  K  (see  Corollary  7.1.1).  Therefore,  it  is 
possible  in  principle  that  rk (CE™C)  <  N,  even  if  C  has  full  row  rank  N. 

A  limiting  ^-distribution  of  Aw  can  also  be  obtained  if  the  inverse  of 
C27“C"  in  (7.6.3)  is  replaced  by  a  generalized  inverse.  In  that  case,  the  asymp¬ 
totic  distribution  of  Aw  is  x2(rk(C27“C'))  if 

rk  {CX%C')  =  rk  (CE%G) 


(7.6.5) 
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with  probability  one  (see  Andrews  (1987)).  Unfortunately,  the  latter  condition 
will  not  hold  in  general.  In  particular,  if  a  cointegrated  system  is  estimated  in 
unconstrained  form  by  multivariate  LS  and  if  27™  is  estimated  as  in  Corollary 
7.1.1,  C27™C"  has  rank  N  with  probability  1,  while  rk(C27™C")  may  be  less 
than  N.  Andrews  (1987)  showed  that  in  such  a  case  the  asymptotic  distribu¬ 
tion  of  A w  may  not  even  be  %2.  A  detailed  analysis  of  the  problem  for  the 
particular  case  of  testing  for  Granger-causality  in  cointegrated  systems  was 
provided  by  Toda  &  Phillips  (1993).  In  this  context,  it  is  perhaps  worth  point¬ 
ing  out  that  the  equality  in  (7.6.5)  may  not  hold,  even  if  the  cointegration 
rank  has  been  specified  correctly  and  the  corresponding  restrictions  have  been 
imposed  in  the  estimation  procedure  (see  Problem  7.9).  For  the  hypothesis  of 
interest  here,  a  possible  solution  to  the  problem  was  proposed  by  Dolado  & 
Liitkepohl  (1996)  and  Toda  &  Yamamoto  (1995).  It  will  be  presented  next. 
Our  discussion  follows  the  former  article. 

Another  possible  approach  to  overcome  inference  problems  in  levels  VARs 
with  integrated  variables  was  described  by  Phillips  (1995).  It  is  known  as  fully 
modified  VAR  estimation  and  is  based  on  nonparametric  corrections.  Some  of 
its  drawbacks  are  pointed  out  by  Kauppi  (2004). 

7.6.3  A  Wald  Test  Based  on  a  Lag  Augmented  VAR 

As  discussed  in  Section  7.2  (see  in  particular  Section  7.2.1),  the  estimators  of 
coefficients  attached  to  stationary  regressors  converge  at  the  usual  T1/2  rate 
to  a  nonsingular  normal  distribution.  Therefore,  the  problem  of  the  previous 
subsection  can  be  solved  if  the  model  can  be  rewritten  in  such  a  way  that  all 
parameters  under  test  are  attached  to  stationary  regressors.  To  this  end,  the 
following  reparameterization  is  helpful: 

v 

Vt  =  'y  AjUt-j  +  Ajj/t-j  +  ut 
j'=i)  iVi 

p  f  p  \ 

=  Aj(yt-i  -  yt-i )  +  Ai)  +  Ut- 

3=ViAi  \i=i  J 

Defining  a  differencing  operator  Ak  such  that  Akyt  =  yt  —  Vt-k  for  k  = 
±1,  ±2, . . .,  the  model  can  be  written  as 

p 

Ayt  =  X!  AiAi-jyt-j +  Hyt-i  +  ut,  (7.6.6) 

i=i  ,j& 

where  II  =  ~(1K  -  A i - Ap),  as  usual.  For  k  >  0,  Akyt  =  (yt  -  yt-i)  + 

(yt~ i  —  yt- 2)  +  •  •  •  +  (yt-k+i  —  yt-k )  is  stationary  as  the  sum  of  stationary 
processes  and  the  same  is  easily  seen  to  hold  for  k  <  0.  Therefore,  it  follows 
from  the  previously  mentioned  results  in  Section  7.2  that  the  LS  estimators 
of  the  Aj,  j  7^  i,  have  a  nonsingular  joint  asymptotic  normal  distribution. 
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Notice  that  these  estimators  are,  of  course,  identical  to  those  based  on  the 
levels  VAR  model  (7.5.1)  because  we  have  just  reparameterized  the  model. 
Hence,  the  following  proposition  from  Dolado  &  Liitkepohl  (1996,  Theorem 
1)  is  obtained. 

Proposition  7.8  ( Asymptotic  Distribution  of  the  Wald  Statistic) 

Let  yt  be  a  A-dimensional  7(1)  process  generated  by  the  VAR(p)  process  in 
(7.5.1)  and  denote  the  LS  estimator  of  A,  by  A,;  (*  =  1  Moreover, 

let  c*(-i)  be  a  K2(p  —  l)-dimensional  vector  obtained  by  deleting  A;  from 
[Ai, . . .  ,Ap]  and  vectorizing  the  remaining  matrix.  Analogously,  let  S(_o  be 
a  K2(p  —  l)-dimensional  vector  obtained  by  deleting  At  from  [A±, . . . ,  Ap)  and 
vectorizing  the  remainder.  Then 

V/T(a(_i)  -  a(_t))  4  7V(0,  Vct(_.)),  (7.6.7) 

where  the  ( K2(p  —  1)  x  K2(p  —  1))  covariance  matrix  is  nonsingular 

and  the  Wald  statistic  A^  for  testing  H0  :  Col (_^  =  0  has  a  limiting  y2(Ar)- 
distribution,  that  is, 

An/  =  Ta,(_i)C'(C'Aa(_i)C,r1Ca(_i)  4  x2(N) 

under  H$.  Here  C  is  an  (N  x  K2(p  —  1))  matrix  with  rk(C)  =  N  and  Ea{i) 

is  a  consistent  estimator  of  Ea,  .  ■ 

‘l) 

Note  that 


=  Plim  T{X(_i)X[_ j))11  0  E, 
where  X(_j)  =  [Xq~1\  . . . ,  XjT*^]  with 


X. 


(-i) 


Ai-iyt-1 

A i—pVt—p 
Ut—i 


(. K2p  x  1) 


and  (A'(_i)AT^_ij)11  denotes  the  upper  left-hand  (K2(p—l)xK2(p—l))  dimen¬ 
sional  submatrix  of  (A(_,:)A^_^)_1.  Thereby  a  consistent  estimator  of  Ea(i) 
is  obtained  as 


£«(_0  =T{X{_i)X[_i))11®Eur 

where  Eu  is  the  residual  covariance  matrix  obtained  from  the  LS  residuals. 

Proposition  7.8  shows  that,  whenever  the  elements  in  at  least  one  of  the 
complete  coefficient  matrices  A*  are  not  restricted  under  Hu,  the  Wald  statistic 
has  its  usual  asymptotic  %2-distribution.  In  other  words,  if  restrictions  are 
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placed  on  all  Aj’s,  i  =  1, . . .  ,p,  as  in  the  noncausality  hypothesis  (7.6.1),  we 
can  get  a  %2  Wald  test  by  adding  an  extra  lag  in  estimating  the  parameters  of 
the  process.  If  the  true  data  generation  process  is  a  VAR(p),  then  a  VAR(p+l) 
with  Ap+ 1  =  0  is  also  a  correct  model.  Because  we  know  that  Ap+  \  =  0,  the 
causality  test  can  be  based  on  the  estimator  ei(_(p+1)),  that  is,  an  estimator 
of  the  first  K 2p  elements  of  vec[Ai, . . . ,  Ap+ 1].  Notice  that  LS  estimation  may 
be  applied  to  the  levels  VAR(p  +  1)  model.  To  carry  out  the  causality  test,  it 
is  not  necessary  to  actually  perform  the  reparameterization  of  the  process  in 
(7.6.6)  because  the  LS  estimators  of  the  A,-  matrices  do  not  change  due  to  the 
reparameterization.  Also,  the  covariance  matrix  of  the  asymptotic  distribution 
may  be  estimated  as  usual  from  the  levels  VAR(p  +  1). 

We  do  not  have  to  know  the  cointegration  properties  of  the  system  to  use 
this  lag  augmentation  test  procedure.  Of  course,  there  may  be  a  loss  of  power 
due  to  over-specifying  the  lag  length.  The  loss  in  power  may  not  be  substantial 
if  the  true  order  p  is  large  and  the  dimension  K  is  small  or  moderate,  because, 
in  this  case,  the  relative  reduction  in  the  estimation  precision  due  to  one  extra 
VAR  coefficient  matrix  may  be  small.  On  the  other  hand,  if  the  true  order  is 
small  and  K  is  large,  an  extra  lag  of  all  variables  may  lead  to  a  sizeable  decline 
in  overall  estimation  precision  and,  hence,  in  the  power  of  the  modified  Wald 
test.  There  are  in  fact  cases,  where  the  extra  lag  is  not  necessary  to  obtain 
the  asymptotic  x2-distribution  of  the  Wald  test  for  Granger-causality.  For 
example,  for  bivariate  processes  with  cointegrating  rank  1,  no  extra  lag  is 
needed,  if  both  variables  are  /(l)  (e.g.,  Lutkepohl  &  Reimers  (1992a)). 

Proposition  7.8  remains  valid  if  deterministic  terms  are  included  in  the 
VAR  model.  This  result  follows  from  the  discussion  in  Section  7.2  because 
including  such  terms  leaves  the  asymptotic  properties  of  the  VAR  coefficients 
unaffected.  It  may  also  be  of  interest  that  a  similar  result  can  be  obtained 
for  VAR  systems  with  1(d)  variables  where  d  >  1.  In  that  case,  d  coefficient 
matrices  At  must  be  unrestricted  under  H0  (see  Dolado  &  Lutkepohl  (1996)). 
Alternatively,  d  lags  must  be  added  if  all  parameter  matrices  of  the  original 
process  are  restricted.  This  result  can  also  be  obtained  from  Sims  et  al.  (1990). 

7.6.4  An  Example 

We  follow  again  Lutkepohl  (2004)  and  use  the  German  interest  rate/inflation 
example  to  illustrate  causality  testing  for  cointegrated  variables.  The  data 
generation  process  is  assumed  to  be  a  VAR(4).  The  model  is  augmented 
by  one  lag  and,  hence,  a  VAR(5)  is  fitted  and  used  in  the  actual  tests  for 
Granger-causality,  while  a  VAR(4)  is  used  for  testing  instantaneous  causality. 
The  results  are  given  in  Table  7.4,  where  F-versions  of  the  Granger-causality 
test  statistics  are  reported.  The  asymptotic  %2-distribution  is  often  a  poor 
approximation  to  the  small  sample  distribution  of  the  causality  test  statistics. 
Therefore,  an  F-version  is  preferred  which  is  obtained  in  the  usual  way  by  di¬ 
viding  the  y2-statistic  by  its  degrees  of  freedom  parameter  (see  Section  3.6). 
As  in  Section  3.6,  the  test  for  instantaneous  causality  is  based  on  the  residual 
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covariance  matrix.  This  approach  is  justified  by  Lemma  7.3  which  shows  that 
the  asymptotic  distribution  of  the  usual  residual  covariance  matrix  estimator 
is  the  same  as  in  the  stationary  case.  Hence,  the  same  test  for  instantaneous 
causality  can  be  used  under  normality  assumptions. 


Table  7.4.  Tests  for  causality  between  German  interest  rate  and  inflation 


causality  hypothesis 

test  value 

distribution 

p- value 

R  Granger-causal  for  Dp 

2.24 

F(  4, 152) 

0.07 

Dp  Granger-causal  for  R 

0.31 

F(  4, 152) 

0.87 

R  and  Dp  instantaneously  causal 

0.61 

x2(i) 

0.44 

None  of  the  p-values  in  Table  7.4  is  smaller  than  0.05.  Therefore,  none  of 
the  noncausality  hypotheses  can  be  rejected  at  the  5%  significance  level.  Given 
the  subset  model  (7.3.9),  this  outcome  is  somewhat  surprising  because  there 
are  clearly  significant  estimates  in  that  model.  Of  course,  using  the  present 
tests  is  a  different  way  of  looking  at  the  data  than  considering  the  individual 
coefficients  in  the  subset  model.  The  relatively  large  number  of  parameters  in 
the  presently  considered  unrestricted  model  which  even  includes  an  extra  lag, 
makes  it  difficult  for  the  sample  information  to  clearly  distinguish  the  sets  of 
parameters  from  their  values  specified  in  the  null  hypothesis. 

The  insignificant  value  of  the  test  for  instantaneous  causality  is  not  surpris¬ 
ing,  however.  The  correlation  matrix  corresponding  to  the  covariance  matrix 
in  (7.3.9)  is 

1  -0.01  ' 

-0.01  1 

Thus,  the  instantaneous  correlation  between  the  two  residual  series  is  very 
small.  This  property  is  reflected  in  the  test  result  in  Table  7.4. 


7.7  Impulse  Response  Analysis 

In  Section  6.7,  we  have  seen  that,  in  principle,  impulse  response  analysis  in 
cointegrated  systems  can  be  conducted  in  the  same  way  as  for  stationary 
systems.  If  estimated  processes  are  used,  the  asymptotic  properties  of  the 
impulse  response  coefficients  and  forecast  error  variance  components  follow 
from  Proposition  3.6  in  conjunction  with  Corollary  7.1.1.  In  other  words,  the 
relevant  covariance  matrices  and  -Ls-  have  to  be  used  in  Proposition  3.6.  Of 
course,  the  remarks  on  Proposition  3.6  regarding  the  estimation  of  standard 
errors  etc.  apply  for  the  present  case  too.  In  practice,  confidence  intervals  for 
impulse  responses  are  typically  computed  with  bootstrap  methods. 

To  illustrate  the  impulse  response  analysis  we  use  again  our  German  in¬ 
terest  rate/inflation  example  system.  We  have  performed  an  impulse  response 
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analysis  on  the  basis  of  the  subset  VECM  (7.3.9)  and  show  forecast  error 
impulse  responses  with  bootstrap  confidence  intervals  determined  by  Hall’s 
percentile  method  (see  Appendix  D.3)  in  Figure  7.3.  Using  forecast  error  im¬ 
pulse  responses  is  unproblematic  here  because  no  instantaneous  causality  and 
no  significant  instantaneous  correlation  between  the  two  residual  series  was 
diagnosed  in  Section  7.6.4.  The  point  estimates  of  the  impulse  responses  look 
very  much  like  those  in  Figure  6.4  in  Chapter  6.  This  similarity  is  not  surpris¬ 
ing  because  the  model  assumed  in  that  chapter  is  very  similar  to  the  present 
one.  Because  the  variables  are  integrated  of  order  one,  the  impulses  have 
permanent  effects.  This  conclusion  can  be  defended  even  if  the  estimation 
uncertainty  is  taken  into  account. 


Fig.  7.3.  Forecast  error  impulse  responses  for  model  (7.3.9)  with  95%  Hall  per¬ 
centile  bootstrap  confidence  intervals  based  on  2000  bootstrap  replications. 


We  emphasize  again  that  an  uncritical  impulse  response  analysis  is  prob¬ 
lematic.  In  particular,  different  sets  of  impulse  responses  exist  and  it  is  not 
clear  which  one  properly  reflects  the  actual  reactions  of  the  variables.  The 
caveats  of  impulse  response  analysis  are  discussed  in  Sections  2.3  and  3.7. 
They  are  therefore  not  repeated  here.  We  will  return  to  impulse  response 
analysis  in  Chapter  9,  when  structural  restrictions  are  discussed  for  identify¬ 
ing  meaningful  shocks. 


7.8  Exercises 
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7.8  Exercises 

7.8.1  Algebraic  Exercises 

Problem  7.1 

Show  that,  in  the  proof  of  Result  6  of  Section  7.1, 

T 

t~1  - ut)y[-'i  =  0P(i). 

t= 1 

^Hint:  Use 

T~X  ~  ut)y?-i  =  («  -  a)T_1  fiyt-iVt-i-) 

t=i  t= l 

Problem  7.2 

Prove  Proposition  7.1  based  on  the  ideas  presented  in  Section  7.2.1.  (Hint: 
See  Ahn  &  Reinsel  (1990).) 

Problem  7.3 

Prove  that  VT[a P'  —  oc((3)(3/]  =  op(l)  holds  in  the  proof  of  Lemma  7.3.  (Hint: 
note  that 

Sp'-5(P)P'  =  5[P'-p']  +  [fi-5(P)]P'.) 

Problem  7.4 

Determine  the  ML  estimators  in  a  cointegrated  VAR(p)  process  with  cointe¬ 
gration  rank  r,  under  the  assumption  that  the  cointegration  matrix  satisfies 
restrictions  p  =  Hip,  where  H  and  ip  are  ( K  x  s)  and  (s  x  r)  matrices,  re¬ 
spectively,  with  r  <  s  <  K.  (Hint:  Proceed  as  in  the  proof  of  Proposition 
7.3.) 

Problem  7.5 

Show  that  the  expressions  in  (7.2.27)  and  (7.2.28)  are  the  LS  estimators  of  a 
and  r,  respectively,  conditional  on  p  =  p. 

Problem  7.6 

Derive  the  EGLS  estimator  for  restrictions  of  the  form  vec[a  :  T]  =  Ifcip  +  r 
on  the  short-run  parameters  of  the  VECM  (7.2.1)  and  state  its  asymptotic 
distribution  (see  (7.3.8)  for  the  definition  of  the  notation). 

Problem  7.7 

Consider  a  cointegrated  VAR(l)  process  without  intercept,  yt  =  A^yt_ i  +  ut , 
and  show  that 

plim  [yT{  1)  -  yT(  1)]  =  plhn  {Ax  -  M)yT  =  0. 

Assume  that  yt  is  Gaussian  with  initial  vector  yo  =  0  and  the  ML  estimator 
A i  is  based  on  y i, . . . ,  yr-  (Hint:  Use  Lemma  7.2  and  plim  yr/T  =  0  from 
Phillips  &  Durlauf  (1986).) 
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Problem  7.8 

Consider  the  matrix  fl(h)  used  in  the  MSE  correction  in  Section  3.5  and 
argue  why  it  is  problematic  for  unstable  processes.  Analyze  in  particular  the 
derivation  in  (3.5.12). 

Problem  7.9 

Consider  a  three-dimensional  VAR(l)  process  with  cointegration  rank  1  and 
suppose  the  cointegrating  matrix  has  the  form  (3  =  ((31,  (32,  0)'.  Use  Corollary 
7.1.1  to  demonstrate  that  the  elements  in  the  last  column  of  A\  have  zero 
asymptotic  variances.  Formulate  a  linear  hypothesis  for  the  coefficients  of  A\ 
for  which  the  rank  condition  (7.6.4)  is  likely  to  be  violated  if  the  covariance 
estimator  of  Corollary  7.1.1  is  used. 


7.8.2  Numerical  Exercises 

The  following  problems  are  based  on  the  U.S.  data  given  in  File  E3  and 
described  in  Section  7.4.3.  The  variables  are  defined  as  in  that  subsection. 

Problem  7.10 

Apply  the  ML  procedure  described  in  Section  7.2.3  to  estimate  a  VAR(3) 
process  with  _cointegration  rank  r  =  1  and  intercept  vector.  Determine  the 
estimates  u,  Ai,A2,  and  A3  and  compare  them  to  unrestricted  LS  estimates 
of  a  VAR(3)  process. 

Problem  7.11 

Compute  forecasts  up  to  10  periods  ahead  using  both  the  unrestricted  VAR(3) 
model  and  the  VAR(3)  model  with  cointegration  rank  1.  Compare  the  fore¬ 
casts. 

Problem  7.12 

Compare  the  impulse  responses  obtained  from  an  unrestricted  and  restricted 
VAR(3)  model  with  cointegration  rank  1. 
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In  specifying  VECMs,  the  lag  order,  the  cointegration  rank  and  possibly  fur¬ 
ther  restrictions  have  to  be  determined.  The  lag  order  and  the  cointegration 
rank  are  typically  determined  before  further  restrictions  are  imposed  on  the 
parameter  matrices.  Moreover,  the  specification  of  a  VECM  usually  starts 
by  determining  a  suitable  lag  length  because,  in  choosing  the  lag  order,  the 
cointegration  rank  does  not  have  to  be  known,  whereas  many  procedures  for 
specifying  the  cointegration  rank  require  knowledge  of  the  lag  order.  There¬ 
fore,  in  the  following,  we  will  first  discuss  the  lag  order  choice  (Section  8.1) 
and  then  consider  procedures  for  determining  the  cointegration  rank  (Section 
8.2).  We  will  comment  on  subset  modelling  in  a  VECM  framework  in  Section 
8.3  and,  in  Section  8.4,  we  will  discuss  checking  the  adequacy  of  such  models. 
More  precisely,  residual  autocorrelation  analysis,  testing  for  nonnormality  and 
structural  change  are  dealt  with. 


8.1  Lag  Order  Selection 

It  was  mentioned  in  Section  7.2.1  that  Wald  tests  for  zero  restrictions  on 
coefficient  matrices  of  the  lagged  differences  can  be  constructed.  Hence,  the 
number  of  lagged  differences  in  a  VECM  can  be  chosen  by  a  sequence  of  tests 
similar  to  that  in  Section  4.2.  Because  the  procedure  and  its  problems  are 
discussed  in  some  detail  in  that  section,  we  will  not  repeat  the  discussion  here 
but  focus  on  order  selection  criteria  such  as  AIC,  HQ,  and  SC  in  this  section. 

In  Section  4.3,  the  FPE  criterion  was  introduced  for  stationary,  stable 
processes  as  a  criterion  that  minimizes  the  forecast  MSE  and  therefore  has  a 
justification  if  forecasting  is  the  objective.  We  have  seen  in  Section  7.5  that  the 
forecast  MSE  correction  used  for  estimated  stationary  processes  is  difficult  to 
justify  in  the  cointegrated  case  and,  hence,  the  FPE  criterion  cannot  be  based 
on  the  same  footing  in  the  latter  case.  This  argument  does  not  mean,  however, 
that  the  criterion  is  not  a  useful  one  in  some  other  sense  for  nonstationary 
processes.  For  instance,  it  is  possible  that  it  still  provides  models  with  excellent 
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small  sample  forecasting  properties.  It  was  also  shown  in  Section  4.3  that 
Akaike’s  AIC  is  asymptotically  equivalent  to  the  FPE  criterion.  Therefore, 
similar  comments  apply  for  AIC. 

The  criteria  HQ  and  SC  were  justified  by  their  ability  to  choose  the  order 
“correctly  in  large  samples”,  that  is,  they  are  consistent  criteria.  It  was  shown 
by  Paulsen  (1984)  and  Tsay  (1984)  that  the  consistency  property  of  these 
criteria  is  maintained  for  integrated  processes.  To  make  that  statement  precise, 
we  give  the  following  result  from  Paulsen  (1984)  without  proof. 

Proposition  8.1  ( Consistent  VAR  Order  Estimation) 

Let 


Vt  —  v  +  Aiyt-i  +  •  •  •  +  Apijt-p  +  Ut 

be  a  A'-dimensional  VAR(p)  process  with  Ap  ^  0  and  standard  white  noise  Ut 

and  suppose  that  clet(lA'  —  A\z - -  —  Apzp)  has  s  roots  equal  to  one,  that  is, 

z  =  1  is  a  root  with  multiplicity  s,  and  all  other  roots  are  outside  the  complex 
unit  circle.  Furthermore,  let 

Cr(m)  =  In  | Eu(m)\  +  mcr/T,  (8.1.1) 

where  Eu(m)  is  the  Gaussian  ML  or  quasi  ML  estimator  of  Eu  for  a  VAR(to) 
model  based  on  a  sample  of  size  T  and  m  fixed  presample  values  as  in  Propo¬ 
sition  4.2,  and  Ct  is  a  nondecreasing  sequence  indexed  by  T.  Let  p  be  such 
that 

Cr(p)  =  min{Cr(m)|m  =  0, 1, . . . ,  M} 

and  suppose  M  >  p.  Then  p  is  a  consistent  estimator  of  p  if  and  only  if 
ct  — »  oo  and  ct/T  — >  0  as  T  — >  oo.  ■ 

This  proposition  extends  Proposition  4.2  to  processes  with  integrated  vari¬ 
ables.  It  implies  that  AIC  is  not  a  consistent  criterion  while  HQ  and  SC  are 
both  consistent.  Thus,  if  consistent  estimation  is  the  objective,  we  may  apply 
HQ  and  SC  for  stationary  and  integrated  processes. 

Denoting  the  orders  chosen  by  AIC,  HQ,  and  SC  by  p(AIC),  p(HQ),  and 
p(SC),  respectively,  we  also  get  from  Proposition  4.3  that 

p(SC)  <  p(HQ)  <  p(AIC)  for  T  >  16. 

This  result  is  obtained  because  Proposition  4.3  does  not  require  any  stationar- 
ity  or  stability  assumptions.  It  follows  as  in  Chapter  4  that  AIC  asymptotically 
overestimates  the  true  order  with  positive  probability  (see  Corollary  4.3.1). 

Although  these  results  are  nice  because  they  generalize  the  stationary  case 
in  an  easy  way,  they  do  not  mean  that  AIC  or  FPE  are  order  selection  criteria 
inferior  to  HQ  and  SC.  Recall  that  consistent  order  estimation  may  not  be  a 
relevant  objective  in  small  sample  situations.  In  fact,  the  true  data  generating 
process  may  not  admit  a  finite  order  VAR  representation. 
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Notice  also  that,  while  we  have  considered  specifying  the  VAR  order  p, 
the  criteria  are  also  applicable  for  choosing  the  number  of  lagged  differences 
in  a  VECM  because  p—  1  lagged  differences  in  a  VECM  correspond  to  a  VAR 
order  p.  Thus,  once  we  know  p,  we  know  the  number  of  lagged  differences. 
If  some  of  the  variables  are  known  to  be  integrated,  the  VAR  order  must  be 
at  least  1.  This  information  can  be  taken  into  account  in  model  selection  by 
searching  only  over  orders  1, . . . ,  M  rather  then  0, 1, ... ,  M. 

We  have  applied  the  three  criteria  AIC,  HQ,  and  SC  to  our  German  interest 
rate/inflation  example  data  from  Section  7.2.6  with  a  maximum  order  of  M  = 
8  and  a  constant  and  seasonal  dummies  in  the  model.  The  values  of  the  criteria 
are  shown  in  Table  8.1.  SC  and  HQ  both  recommend  the  order  p  =  1  while 
p(AIC)  =  4.  Thus,  in  a  VECM  based  on  SC  and  HQ,  no  lagged  differences 
appear,  whereas  three  lagged  differences  have  to  be  included  according  to  AIC. 
We  have  chosen  to  go  with  the  AIC  estimate  in  the  example  in  Section  7.2.6. 


Table  8.1.  VAR  order  estimation  for  interest  rate/inflation  system 


VAR  order 

m 

AIC(m) 

HQ(m) 

SC(m) 

0 

-18.75 

-18.75 

-18.75 

1 

-20.98 

-20.94* 

-20.88* 

2 

-20.97 

-20.89 

-20.76 

3 

-20.89 

-20.77 

-20.58 

4 

-20.99* 

-20.82 

-20.57 

5 

-20.93 

-20.72 

-20.41 

6 

-20.89 

-20.63 

-20.26 

7 

-20.85 

-20.55 

-20.12 

8 

-20.80 

-20.46 

-19.96 

*  Minimum. 


In  Chapter  4,  we  have  mentioned  that  model  selection  may  be  based  on 
the  residual  autocorrelations  or  portmanteau  tests.  These  statistics  can  also 
be  used  for  VECMs.  They  are  discussed  in  Section  8.4.1. 
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Although  model  selection  criteria  have  also  been  used  in  specifying  the  coin¬ 
tegrating  rank  of  a  VECM  (e.g.,  Liitkepohl  &  Poskitt  (1998)),  it  is  more 
common  in  practice  to  use  statistical  tests  for  this  purpose.  Many  different 
tests  have  been  proposed  in  the  literature  and  the  properties  of  most  of  them 
depend  on  the  deterministic  terms  included  in  the  model.  In  the  following, 
we  will  therefore  discuss  models  with  different  deterministic  terms  separately. 
The  general  model  is  assumed  to  be  of  the  form 


yt  =  pt  +  xt, 
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where  xt  is  the  stochastic  part  which  is  assumed  to  have  a  VECM  represen¬ 
tation  without  deterministic  terms  and  /xt  is  the  deterministic  term,  as  in 
Chapter  6,  Section  6.4.  We  will  start  with  the  easiest  although  most  unrealis¬ 
tic  case  where  no  deterministic  term  is  present  and,  thus,  /xt  =  0.  Most  of  the 
discussion  will  focus  on  likelihood  ratio  (LR)  tests  and  close  relatives  of  them 
because  they  are  very  common  in  applied  work  and  they  also  fit  well  into  the 
present  framework.  Some  comments  on  other  procedures  will  be  provided  in 
Section  8.2.9. 

8.2.1  A  VECM  without  Deterministic  Terms 

Based  on  Proposition  7.3,  it  is  easy  to  derive  the  likelihood  ratio  statistic  for 
testing  a  specific  cointegration  rank  r  =  ro  of  a  VECM  against  a  larger  rank 
of  cointegration,  say  r  =  r\.  Consider  the  VECM  without  determinist  terms, 

Ayt  =  Ilyt-i  +  T1Ayt-i  +  •  •  •  +  Tp-iAyt-p+i  +  Ut ,  (8.2.1) 

where  yt  is  a  process  of  dimension  K ,  rk(II)  =  r  with  0  <  r  <  K,  the 
F/s  (j  =  1, . . .  ,p  —  1)  are  ( K  x  K)  parameter  matrices  and  ut  ~  A/”(0,  Su)  is 
Gaussian  white  noise,  as  in  Chapter  7,  Section  7.2.3.  For  simplicity  we  assume 
that  the  process  starts  at  time  t  =  1  with  zero  initial  values  (i.e.,  yt  =  0  for 
t  <  0).  Alternatively,  the  initial  values  may  be  any  fixed  values. 

Suppose  we  wish  to  test 

H0  :  rk(II)  =  ro  against  Hi  :  ?’0  <  rk(II)  <  r±.  (8.2.2) 

Under  normality  assumptions,  the  maximum  of  the  likelihood  function  for  a 
model  with  cointegration  rank  r  is  given  in  Proposition  7.3.  From  that  result, 
the  LR  statistic  for  testing  (8.2.2)  is  seen  to  be 

Az,i?(r0,ri)  =  2  [In  l(r±)  —  In  Z(r0)] 

n  ry 

=  T  -^lntl-AO  +  ^Ml-Ai) 

.  i—1  i= 1 

ri 

=  -T  Ml -AO,  (8-2.3) 

i=T‘o  +  l 

where  l(r-i)  denotes  the  maximum  of  the  Gaussian  likelihood  function  for 
cointegration  rank  r,;.  Obviously,  the  test  value  is  quite  easy  to  compute, 
using  the  eigenvalues  from  Proposition  7.3. 

It  turns  out,  however,  that  the  asymptotic  distribution  of  the  LR  statistic 
under  the  null  hypothesis  for  given  ro  and  ri  is  nonstandard.  In  particular,  it 
is  not  a  ^-distribution.  It  depends  on  the  number  of  common  trends  K  —  r0 
under  H0  and  on  the  alternative  hypothesis.  Two  different  pairs  of  hypotheses 
have  received  prime  attention  in  the  related  literature: 


H0  :  rk(LE)  =  ro  versus  Hi  \  tq  <  rk(II)  <  K 


(8.2.4) 
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and 

H0  ■  rk(II)  =  T’o  versus  H\  :  rk(II)  =  r0  +  1.  (8.2.5) 

The  LR  statistic  Xlr{to,  K)  for  checking  (8.2.4)  is  often  referred  to  as  the  trace 
statistic  for  testing  the  cointegrating  rank  and  Xlr{to,  ro+1)  is  called  the  max¬ 
imum  eigenvalue  statistic.  Johansen  (1988,  1995)  shows  that  the  asymptotic 
distributions  of  these  LR  statistics  under  the  null  hypothesis  are 

XLR(ro,K)  4tr(P)  (8.2.6) 

and 

ALfl(r0,r0  +  l)  4  Amax(P),  (8.2.7) 

where  Amax(2?)  denotes  the  maximum  eigenvalue  of  the  matrix  T>  and 

V  ■=  (/  WdW'^  WW'ds^j  WdW'j  .  (8.2.8) 

Here  W  :=  Wif_ro(s)  stands  for  a  ( K  —  ro)-dimensional  standard  Wiener 
process.  In  other  words,  the  limiting  null  distributions  are  functionals  of  a  ( K — 
ro)-dimensional  standard  Wiener  process.  Percentage  points  of  the  asymptotic 
distributions  and,  thus,  critical  values  for  the  LR  tests  can  be  generated  easily. 
Tables  are,  for  example,  available  in  Johansen  (1995).  Hence,  a  LR  test  is 
available  under  Gaussian  assumptions  and,  as  usual,  the  test  statistics  have 
the  same  limiting  distributions  even  if  the  underlying  process  is  not  normally 
distributed  but  satisfies  the  more  general  assumptions  used  in  Section  7.2,  for 
example. 

The  strategy  for  determining  the  cointegrating  rank  of  a  given  system  of 
K  variables  is  to  test  a  sequence  of  null  hypotheses, 

H0:  rk(n)=0,  H0  :  rk(II)  =  1,...,H0  :  rk(II)  =  K-  1,  (8.2.9) 

and  terminate  the  tests  when  the  null  hypothesis  cannot  be  rejected  for  the 
first  time.  The  cointegrating  rank  is  then  chosen  accordingly.  Both  the  max¬ 
imum  eigenvalue  and  the  trace  tests  may  be  used  here.  For  example,  if  there 
are  three  variables  ( K  =  3),  we  first  test  rk(II)  =  0.  If  this  null  hypothesis 
cannot  be  rejected,  the  analysis  proceeds  with  a  cointegration  rank  of  r  =  0 
and,  hence,  a  model  in  first  differences  is  considered  in  the  subsequent  anal¬ 
ysis.  If,  however,  rk(II)  =  0  is  rejected,  we  test  rk(LE)  =  1.  Should  the  test 
not  reject  this  hypothesis,  the  analysis  may  proceed  with  a  VECM  with  coin¬ 
tegrating  rank  r  =  1.  Otherwise  rk(LE)  =  2  is  tested  and  r  =  2  is  chosen  as 
the  cointegrating  rank  if  this  hypothesis  cannot  be  rejected.  If  rk(LE)  =  2  is 
also  rejected,  one  may  consider  working  with  a  stationary  VAR  model  for  the 
levels  of  the  variables. 
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Clearly,  in  these  tests  the  lag  order  has  to  be  known.  In  practice,  it  is  often 
chosen  by  one  of  the  model  selection  criteria  discussed  in  the  previous  section, 
based  on  the  levels  VAR  model,  before  the  cointegrating  rank  is  tested. 

As  mentioned  previously,  the  model  framework  in  (8.2.1)  is  too  simple  for 
practical  purposes  because  deterministic  terms  are  usually  needed  to  describe 
the  generation  process  of  a  given  set  of  time  series  properly.  Therefore,  we  will 
now  consider  processes  with  deterministic  terms. 


8.2.2  A  Nonzero  Mean  Term 

We  now  assume  that  the  deterministic  term  consists  of  a  simple  constant  mean 
term  only, 


=  (8.2.10) 

Although  we  typically  think  of  Ho  as  a  fixed  nonzero  ( K  x  1)  vector,  the  case 
Ho  =  0  is  not  explicitly  excluded.  In  other  words,  the  user  of  the  test  is  not 
sure  that  the  process  mean  is  zero  and  therefore  allows  for  the  possibility  of  a 
nonzero  mean  term.  In  Section  6.4,  we  have  seen  that  in  this  case  the  VECM 
for  the  observable  variables  yt  can  be  written  as 

Ayt  =  +  TiAyt-i  +  •  •  •  +  Tp-iAyt-p+i  +  Ut,  (8.2.11) 

where 


Vt-i  := 


Vt- 1 
1 


and  11°  :=  [II  :  zz0]  is  ( K  x  (K  + 1))  with  v0  '■=  Thus,  the  LR  statistic 

for  testing  the  cointegration  rank  can  be  determined  exactly  as  in  the  zero 
mean  case  considered  in  the  previous  subsection,  except  that  yt-i  has  to 
be  replaced  by  y°_i  in  the  relevant  formulas  from  which  the  eigenvalues  are 
computed  in  Proposition  7.3.  In  this  case,  the  LR  statistics  have  asymptotic 
null  distributions  as  in  (8.2.6)  and  (8.2.7),  where  now 


with 


W°W  °'ds 


W°dW' ) 


W°  :=  W°(s) 


Wic-ro(s) 

1 


(8.2.12) 


(see  Johansen  (1991)).  Again,  critical  values  may  be  found  in  Johansen  (1995). 
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8.2.3  A  Linear  Trend 


A  process  with  a  linear  trend  is  also  of  interest  from  a  practical  point  of  view. 
Hence,  let 


Ht  =  Ho  +  Hit,  (8.2.13) 

where  yo  and  /ii  are  arbitrary  ( K  x  1)  vectors.  In  Section  6.4,  we  have  seen 
that  in  this  case  the  VECM  for  the  observable  yt  can  be  represented  as 

Ayt  =  v  +  n+j/t+ :  +  FiAyt-i  +  •  •  •  +  Tp_i  Ayt-p+i  +  Ut,  (8.2.14) 

where  v  :=  —  Hy0  +  (lK  —  Ti  —  - - rp_i)/n,  II+  :=  [II  :  zq]  is  a  ( K  x  (A+l)) 

matrix  with  zq  :=  —  II/zi,  and 


vt- 1  ;= 


yt- 1 
t- l 


Thus,  the  LR  statistics  of  interest  can  again  be  determined  exactly  as  in  the 
zero  mean  case  of  Section  8.2.1  by  replacing  z/t-i  with  yt-\  and  accounting 
for  the  intercept  term  by  adding  a  row  of  ones  in  AX  in  the  relevant  formulas 
in  Proposition  7.3  (see  Section  7.2.4).  For  the  present  case,  the  LR  statistics 
have  asymptotic  null  distributions  as  in  (8.2.6)  and  (8.2.7)  with 

V  :=  W+dW'^j  W+W +'ds^j  W+dW'^j  .  (8.2.15) 

Here  W+  abbreviates  the  {K— r0+l)-dimensional  stochastic  process  W+(s)  := 
[W(s)',  s  —  \]'  with  W(s)  :=  W/i_ru(s)  —  J1  W K-r0(u)du  being  a  demeaned 
standard  Wiener  process,  as  shown  by  Johansen  (1994,  1995).  Critical  values 
may  also  be  found  in  the  latter  reference. 


8.2.4  A  Linear  Trend  in  the  Variables  and  Not  in  the 
Cointegration  Relations 

In  the  model  (8.2.14),  the  linear  trend  term  is  unrestricted  and  therefore  may 
also  be  part  of  the  cointegration  relations.  Even  if  the  variables  have  a  linear 
trend,  it  is  possible  that  there  is  no  such  term  in  the  cointegration  relations. 
In  other  words,  the  cointegration  relations  are  drifting  along  a  common  linear 
trend.  This  situation  can  arise  if  the  trend  slope  is  the  same  for  all  variables 
which  have  a  linear  trend.  Formally  this  case  occurs  if  /ii  ^  0  and  LE/zi  = 
ap7*i  =  0  or,  equivalently,  if  PVi  =  0-  In  other  words,  this  situation  is  present 
if  the  trend  parameter  yi  is  nonzero  and  it  is  orthogonal  to  the  cointegration 
relations.  In  this  case,  (8.2.14)  reduces  to 


Ayt  —  v  +  IIyt_i  +  T\Ayt-i  +  •  •  •  +  rp_iAyt_p+i  +  ut- 


(8.2.16) 
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Thus,  in  this  situation  we  have  a  model  just  like  (8.2.1),  except  that  there  is 
an  intercept  term  in  addition.  Again,  the  LR  statistics  for  testing  (8.2.4)  or 
(8.2.5)  can  be  determined  easily  as  in  the  zero  mean  case  of  Section  8.2.1  by 
adding  a  row  of  ones  in  AX  in  the  relevant  formulas  in  Proposition  7.3  (see 
Section  7.2.4).  The  limiting  distributions  of  the  LR  statistics  under  the  null 
hypothesis  are  also  as  in  (8.2.6)  and  (8.2.7),  where  now 

V  :=  (/  W<fW'j  (/  WW'dsj  (/  WdW'^j  .  (8.2.17) 

Here  W  :=  W(s)  :=  Wc(s)  -  f*  Wc(u)du,  where  Wc(s)  :=  [WK_ro_i(s)',  a]' 
is  a  ( K  —  ro) -dimensional  stochastic  process.  This  result  and  corresponding 
critical  values  for  the  tests  may  also  be  found  in  Johansen  (1995). 

Notice  that  the  condition  y\  ^  0  and  II/zi  =  0  rules  out  the  situation 
where  rk(II)  =  K  because,  for  a  nonsingular  matrix  II,  the  relation  II/xi  =  0 
cannot  hold  for  a  nonzero  y\.  Thus,  the  assumptions  made  for  deriving  the 
limiting  distributions  of  the  test  statistics  make  a  test  of 

Ho  :  rk(II)  =  K  —  1  versus  H\  :  rk(II)  =  K 

meaningless.  Intuitively,  this  result  is  obtained  because,  if  II  has  full  rank,  the 
data  generation  process  is  stationary  and,  in  that  case,  a  VAR  process  with 
an  intercept  does  not  generate  a  linear  trend.  Thus,  if  a  linear  trend  is  known 
to  be  present  in  the  variables,  II  cannot  have  full  rank  in  a  model  where  an 
intercept  is  the  only  deterministic  term. 


8.2.5  Summary  of  Results  and  Other  Deterministic  Terms 

The  results  of  the  previous  subsections  are  summarized  in  the  following  propo¬ 
sition. 

Proposition  8.2  ( Limiting  Distributions  of  LR  Tests  for  the  Cointegrating 
Rank) 

Suppose  yt.  =  yt  +  %t,  where  yt  is  a  deterministic  term  and  Xt  is  a  purely 
stochastic  Gaussian  process  defined  by 

Ax%  T \Axt—  i  -f-  *  *  *  T  r^— iAxt— p-\~±  T  ut ,  t  —  1,2,..., 

where  all  symbols  are  defined  as  in  (8.2.1)  and  Xt  =  0  for  t  <  0.  Then  the  LR 
statistics  for  testing  (8.2.4)  and  (8.2.5)  have  limiting  null  distributions 

\LR(r0,K)  4tr(P) 


X LR(ro,r0  +  1)  -►  A  max  (£>), 


and 
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respectively,  where 

with 

(1)  F(s)  =  W*r_ro(s),  if  m  =  0  a  priori, 

(2)  F(s)  =  W°(s)  =  [W/i_ro(s)'  :  1]',  if  Ht  =  Mo  is  a  constant, 

(3)  F(s)  =  [W(s)',  s  —  |]'  as  in  (8.2.15),  if  Mt  =  Mo  +  Mi^  is  a  linear  trend, 

(4)  F(s)  =  W(s)  as  in  (8.2.17),  if  Mt  =  M o  +  Mi^  is  a  linear  trend  with  Mi  ^  U 
and  (3'mi  =  0,  that  is,  the  trend  is  orthogonal  to  the  cointegration  relations. 


Several  remarks  are  worthwhile  with  respect  to  this  result. 

Remark  1  Percentage  points  of  the  asymptotic  distributions  in  Proposition 
8.2  are  easy  to  simulate  by  considering  multivariate  random  walks  of  the  form 

Xt  =  xt-\  +  ut,  t  =  1,2, ...  ,T, 

where  xq  =  0  and  ut  ~  7V(0,  Ik)  is  Gaussian  white  noise,  that  is, 

t 

xt  = 

j=i 

Noting  that 

T  ,1 

r2y'i,_n;_1i  /  ww 'da, 

t=i 


/  d 

1  ut  -> 


WdW', 


and  so  on  (see  Appendix  C.8,  Proposition  C.18),  we  can,  for  example,  approx- 
imate 


tr 


) 


WrfW'l  (  /  WW'rfs)  f  /  WdW') 


y 


by 


tr 


E  E 


for  a  large  sample  size  T.  Similar  approximations  can  be  used  for  the  other 
asymptotic  distributions  (see  also  Problem  8.2).  ■ 
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Remark  2  Although  we  only  give  the  limiting  distributions  of  the  LR  statis¬ 
tics  under  the  null  hypothesis  in  the  proposition,  the  asymptotic  distributions 
under  local  alternatives  of  the  form 

n  =  ap'  +  ia1p,1 

were  also  derived  (see  Johansen  (1995)  and  Saikkonen  &  Liitkepohl  (1999, 
2000a)).  Here  a  and  p  are  fixed  ( K  x  ro)  matrices  of  rank  ro  and  0Ci  and  pj 
are  fixed  ( K  x  (r  —  ro))  matrices  of  rank  r  —  r o  and  such  that  the  matrices 
[a  :  oci]  and  [p  :  p-J  have  full  column  rank  r.  Thus,  in  this  setup,  the  matrix 
n  is  assumed  to  depend  on  the  sample  size.  Local  power  studies  have  been 
performed  to  shed  light  on  the  power  properties  of  the  LR  tests  when  the 
alternative  is  true  but  the  corresponding  parameter  values  are  close  to  the 
region  where  the  null  hypothesis  holds.  ■ 

Remark  3  Power  comparisons  between  the  alternative  test  versions  can  help 
in  deciding  whether  to  use  trace  or  maximum  eigenvalue  tests.  Liitkepohl, 
Saikkonen  &  Trenkler  (2001)  performed  a  detailed  small  sample  and  lo¬ 
cal  power  comparison  of  several  test  versions  and  concluded  that  trace  and 
maximum  eigenvalue  tests  have  very  similar  local  power  in  many  situations, 
whereas  each  test  version  has  its  relative  advantages  in  small  samples,  de¬ 
pending  on  the  criterion  for  comparison.  Thus,  neither  of  the  tests  is  generally 
preferable  in  practice.  ■ 

Remark  4  It  is  also  possible  to  derive  the  asymptotic  properties  of  the  LR 
tests  for  other  deterministic  terms.  For  example,  higher  order  polynomial 
trends  may  be  considered.  Such  terms  lead  to  changes  in  the  null  distribu¬ 
tions  of  the  test  statistics.  We  do  not  consider  them  here  because  they  seem 
to  be  of  lesser  importance  from  a  practical  point  of  view.  ■ 

Remark  5  Seasonal  dummy  variables  are  another  type  of  deterministic  terms 
which  are  of  practical  importance.  They  are  often  used  to  account  for  seasonal 
fluctuations  in  the  variables  (see,  e.g.,  the  example  in  Section  7.2.6).  If  sea¬ 
sonal  dummies  are  added  in  addition  to  an  unrestricted  intercept  term,  they 
do  not  affect  the  asymptotic  distributions  of  the  LR  statistics  for  the  cointe¬ 
gration  rank.  We  have  considered  two  models,  however,  where  no  unrestricted 
intercept  term  was  included.  The  first  one  was  the  model  of  Section  8.2.1 
without  any  deterministic  terms  at  all.  As  this  model  is  of  limited  practical 
use  anyway,  we  do  not  consider  the  implications  of  adding  seasonal  dummy 
variables.  The  other  model  without  an  unrestricted  intercept  term  was  the  one 
with  a  nonzero  mean  discussed  in  Section  8.2.2.  It  is  of  more  use  in  practice 
and  it  is  therefore  of  interest  to  consider  the  possibility  of  adding  seasonal 
dummies. 

Suppose  there  are  q  seasons  and  the  deterministic  term  is  of  the  form 
9-1 

Mi  Mo  J-  ^ 

i—1 
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where  yo  and  Si  ( i  =  —  1)  are  ( K  x  1)  parameter  vectors  and  the 

seasonal  dummies  are  denoted  by  su.  Suppose  that  they  are  defined  such  that 
they  are  orthogonal  to  the  intercept  term,  that  is, 

J  1  if  t,  is  associated  with  season  i, 

Slt  ~  l  iFT  otherwise, 

for  i  =  1, . . . ,  q.  In  that  case,  using  the  same  line  of  reasoning  as  in  Section 
6.4,  the  corresponding  VECM  for  yt  is 


9-1 

Ayt  =  II  °y°_i  +  TiAyt_i  +  •  •  •  +  T  p_iAyt_p+i  +  S*sn  +  ut , 

i- 1 

where  the  J*’s  are  ( I<  x  1)  parameter  vectors.  Notice  that  Lsu  =  =  Sj_ ijt 

for  i  =  2, . . . ,  q  and  Ls±t  =  sqt  and,  for  any  t,  J2i=i  sn  =  0  so  that  sqt  = 
—  i  sit ■  Hence,  the  latter  sum  can  be  substituted  for  sqt  (see  also  Problem 

8.1).  In  this  model,  the  seasonal  dummies  have  no  impact  on  the  asymptotic 
distribution  of  the  LR  statistic  for  the  cointegrating  rank  (Johansen  (1991)). 


Remark  6  A  different  situation  arises  if  the  deterministic  term  includes  a 
shift  dummy  variable  I(t>TB)  which  is  zero  up  to  time  TB  and  then  jumps 
to  one.  Such  a  variable  affects  the  asymptotic  distributions  of  the  LR  test 
statistics  for  the  cointegrating  rank.  In  fact,  Johansen,  Mosconi  &  Nielsen 
(2000)  showed  that  in  this  case  the  asymptotic  distributions  depend  on  where 
the  shift  occurs  in  the  sample.  More  precisely,  it  depends  on  the  fraction  of 
the  sample  before  the  break.  In  contrast,  impulse  dummy  variables  which 
are  always  zero  except  in  one  specific  period,  do  not  affect  the  asymptotic 
properties  of  the  LR  tests.  ■ 


8.2.6  An  Example 

We  have  applied  LR  trace  tests  for  the  cointegrating  rank  to  the  German 
interest  rate/inflation  example  data  from  Section  7.2.6  and  give  results  for 
different  lag  orders  in  Table  8.2.  Notice  that,  although  we  report  the  results  for 
the  trace  tests,  the  maximum  eigenvalue  variant  is  equivalent  if  H0  :  rk(II)  =  1 
is  tested  in  a  bivariate  system.  In  that  case,  the  alternative  hypotheses  in 
(8.2.4)  and  (8.2.5)  coincide.  Because  the  inflation  rate  has  a  strong  seasonal 
pattern,  we  have  included  seasonal  dummy  variables  in  the  deterministic  term. 
Given  the  theoretical  considerations  in  Section  7.2.6,  one  may  not  see  the 
need  for  a  general  trend  in  the  model.  Clearly,  one  would  not  expect  the 
cointegration  relation  to  include  a  linear  trend.  In  fact,  one  may  wonder  about 
the  need  to  consider  a  deterministic  linear  trend  at  all  in  the  model  because 
one  could  argue  that  neither  interest  rates  nor  inflation  rates  are  likely  to  have 
such  components  in  Germany.  Even  if  there  is  a  strong  case  for  excluding  the 
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possibility  of  a  linear  trend  term  in  a  long-run  analysis  of  these  two  variables, 
it  may  still  be  useful  to  include  such  a  trend  for  a  particular  sample  period. 
Recall  that  any  model  is  just  an  approximation  to  the  data  generation  process 
for  a  specific  period  of  time.  In  Table  8.2,  we  therefore  report  results  for 
different  deterministic  terms. 


Table  8.2.  LR  trace  tests 
rate/inflation  system 

for  the  cointegration  rank 

of  the 

German 

interest 

deterministic 

no.  of  lagged 

null 

test 

critical  values 

term 

differences 

hypothesis 

value 

10% 

5% 

constant,  seasonal  dummies 

0 

rk(n)  = 

0 

89.72 

17.79 

19.99 

rk(n)  = 

1 

1.54 

7.50 

9.13 

3 

rk(n)  = 

0 

21.78 

17.79 

19.99 

rk(n)  = 

1 

4.77 

7.50 

9.13 

orthogonal  linear  trend, 

0 

rk(II)  = 

0 

89.10 

13.31 

15.34 

seasonal  dummies 

3 

rk(n)  = 

0 

20.80 

13.31 

15.34 

linear  trend, 

0 

rk(II)  = 

0 

97.21 

22.95 

25.47 

seasonal  dummies 

rk(n)  = 

1 

4.45 

10.56 

12.39 

3 

rk(II)  = 

0 

24.78 

22.95 

25.47 

l-k(n)  = 

1 

7.72 

10.56 

12.39 

Notes:  Sample  period:  1972.2  —  1998.4  (including  presample  values).  Critical  values 
from  Johansen  (1995,  Tables  15.2,  15.3,  and  15.4). 


For  all  deterministic  terms  and  all  lag  orders,  the  tests  reject  a  cointegrat¬ 
ing  rank  of  zero.  The  only  possible  exception  is  the  case,  where  a  fully  general 
linear  trend  and  three  lagged  differences  are  included  in  the  model.  In  that 
case,  the  cointegration  rank  zero  can  only  be  rejected  at  the  10%  level  and 
not  at  the  5%  level,  whereas  in  all  other  cases  the  tests  reject  at  a  5%  level. 
Of  course,  the  model  with  three  lagged  differences  and  a  linear  deterministic 
trend  is  the  least  restricted  model  considered  in  Table  8.2.  Thus,  if  any  one 
of  the  other  models  describes  the  DGP  well,  the  same  is  true  for  the  latter 
model.  Therefore,  one  may  argue  that  the  tests  based  on  this  model  should  be 
the  most  reliable.  Unfortunately,  such  an  argument  is  valid  for  the  size  of  the 
test  at  best.  In  small  sample  studies,  some  evidence  was  found  that  redundant 
lags  or  deterministic  terms  can  have  a  negative  effect  on  the  powers  of  the  LR 
tests  (see  Hubrich,  Liitkepohl  &  Saikkonen  (2001)  for  an  overview  of  small 
sample  studies).  Thus,  taking  the  small  sample  properties  of  the  tests  into  ac¬ 
count,  there  is  substantial  evidence  that  the  cointegrating  rank  is  larger  than 
zero. 

For  the  models  with  a  constant  and  a  linear  trend,  none  of  the  tests  can 
reject  a  cointegration  rank  of  r  =  1.  If  a  deterministic  linear  trend  is  assumed 
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to  be  present  in  at  least  one  of  the  variables  and  not  in  the  cointegration 
relations,  that  is,  the  trend  is  orthogonal  to  the  cointegration  relations,  then 
testing  the  null  hypothesis  rk(II)  =  1  does  not  make  sense  for  a  bivariate 
system,  as  explained  in  Section  8.2.4.  Therefore,  no  results  are  reported  for 
that  null  hypothesis  in  Table  8.2.  Thus,  the  evidence  in  favor  of  a  single 
cointegration  relation  in  our  example  system  is  overall  quite  strong.  Therefore, 
we  have  used  this  rank  in  previous  models  for  the  two  series. 

The  discussion  of  which  deterministic  terms  to  include  in  the  model  for  our 
example  data  shows  that  there  is  a  need  for  statistical  procedures  to  assist 
in  the  decision.  There  are  indeed  appropriate  tests  available,  as  discussed 
in  Section  7.2.4.  We  will  return  to  some  such  tests  for  specific  hypotheses  of 
interest  in  the  present  context  in  Section  8.2.8.  Before  we  do  so,  we  will  discuss 
some  other  ideas  for  testing  the  cointegrating  rank  of  a  VECM.  In  the  next 
subsection,  we  consider  the  possibility  of  subtracting  the  deterministic  part 
first  and  then  applying  LR  type  tests  to  the  adjusted  series. 

8.2.7  Prior  Adjustment  for  Deterministic  Terms 

LR  tests  for  the  cointegrating  rank  were  found  to  have  low  power,  in  particular 
in  large  models  (large  dimension  and/or  long  lag  order).  Therefore,  other  tests 
and  test  variants  have  been  proposed  which  have  advantages  at  least  in  some 
situations.  One  variant  was,  for  instance,  proposed  by  Saikkonen  &  Liitkepohl 
(2000d).  They  suggested  a  two-step  procedure  in  which  the  deterministic  part 
is  estimated  first.  Then  the  observed  series  are  adjusted  for  the  deterministic 
terms  and  an  ‘LR  test’  is  applied  to  the  adjusted  system.  We  will  discuss  their 
approach  for  the  case  of  a  model  with  a  linear  trend  term.  The  other  cases  of 
interest  can  be  handled  with  straightforward  modifications. 

Thus,  we  consider  a  data  generation  process  of  the  form 

Ut  =  Ho  +  Hit  +  Xt,  (8.2.18) 

where  /io  and  pi  are  fully  general  ( K  x  1)  vectors  and  xt  has  a  VECM  repre¬ 
sentation  of  the  form  (8.2.1).  Hence,  the  data  generation  process  of  yt  has  the 
VECM  representation  (8.2.14).  Suppose  we  want  to  test  the  pair  of  hypotheses 

U0  :  rk(LE)  =  r0  versus  Hi  :  rk(II  )  >  r0- 

Then  the  model  (8.2.14)  is  estimated  by  ML  with  a  cointegration  rank  ro 
and  estimators  a,  (3.  Tj  (j  =  1, ...  ,p  —  1)  as  well  as  estimators  of  the  other 
parameters  are  obtained.  From  these  estimators  we  can  get  estimators  of  the 
levels  VAR  parameter  matrices  as  follows  (see  Section  6.3,  Eq.  (6.3.7)): 

Ai  =  Ik  +  a|3'  +  Ti , 

Ai  =  f  j  —  i  =  2,...,p—  1, 

Ay  =  — Tp_l. 
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These  estimators  are  used  to  estimate  the  parameters  y0  and  /i-|  in  (8.2.18) 
by  an  EGLS  procedure.  To  present  the  estimator,  we  define  A(L)  :=  lK  — 
A\L  —  ■  ■  ■  —  ApLp,  Gt  :=  A(L)at,  and  Ht  :=  A(L)bt,  with 

1  for  t  >  1,  _  f  t  for  t  >  1, 

0  for  t  <  0,  4  (0  for  t  <  0. 

Moreover,  we  dehne 

(a'E-'ayWa'E-1  ' 

Premultiplying  (8.2.18)  by  QA{L)  gives 

QA(L)yt  =  QGtno  +  QHtfi1  +  ri;,  t  =  p+l,...,T,  (8.2.19) 

where  the  transformation^  ensures  that  the  error  term  has  roughly  a  unit  co- 
variance  matrix  because  Q'Q  =  E~x .  Thus,  estimating  the  transformed  model 
(8.2.19)  by  LS  amounts  to  EGLS  estimation  of  po  and  Mi  in  the  untransformed 
model  yt  =  Mo+Mi t+xt-  The  resulting  estimators  of  po  and  pi  will  be  denoted 
by  PqLS  an(l  respectively. 

Using  these  estimators,  yt  can  now  be  trend-adjusted  as  xt  :=  yt—J^oLS  — 
JlfLSt  and  an  ‘LR  test’  can  be  applied  to  xt,  as  described  in  Section  8.2.1.  Of 
course,  although  the  test  statistics  are  computed  in  the  same  way  as  described 
in  that  section  except  that  yt  is  replaced  by  xt,  the  tests  are  now  not  really 
LR  tests  anymore  because  they  are  applied  to  adjusted  data  rather  than  the 
original  ones.  To  distinguish  the  resulting  tests  from  the  actual  LR  tests,  we 
will  refer  to  them  as  GLS-LR  tests  and  we  denote  the  trace  and  maximum 
eigenvalue  test  statistics  as  A^j^ro,  K)  and  A^s(ro,  ro  +  1),  respectively,  in 
the  following.  Given  that  these  tests  are  not  actual  LR  tests,  it  may  also  not 
be  surprising  that  the  limiting  distributions  of  the  test  statistics  are  different 
from  those  of  the  actual  LR  statistics.  They  also  depend  on  the  deterministic 
terms  that  are  included  in  the  model.  To  state  the  asymptotic  distributions 
formally,  we  use  the  following  conventions.  A  Brownian  bridge  of  dimension 
K  —  r0  is  defined  as 

WS(s)  =  WK_ro(S)  -  SWK-r0(l) 

and  an  integral  of  a  stochastic  process  F  with  respect  to  a  Brownian  bridge 
is  defined  as 

•  l  ,i  ,i 

FdWB  :=  /  FdWfc-ro  ~  /  FdsWK_ro(l). 

)  Jo  Jo 

Now  we  can  state  the  limiting  null  distributions  of  the  A^s  statistics  for  the 
different  deterministic  terms  of  interest. 
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Proposition  8.3  ( Limiting  Distributions  of  GLS-LR  Tests  for  the  Cointe¬ 
grating  Rank) 

Under  the  conditions  of  Proposition  8.2,  the  GLS-LR  test  statistics  have  the 
following  limiting  null  distributions: 

XLRS(ro,K)  4  tr(X>) 

and 

^LRS(r  oUo  +  1)  — ►  Amax(P), 


where  D  depends  on  the  deterministic  terms  included  in  the  model  as  follows: 

(1)  If  fit  =  /to  is  a  constant, 

WdMV'\  (  I  WW'ds^ 


V  = 


WdW' 


with  W:=  W K-r0(s). 

(2)  If  jj,t  =  /j,Q  +  Hit  is  a  linear  trend, 


V  = 


WBdWBl 


WBW  B,ds 


WBdWBl 


(3)  If  /if  =  /io  +  hi t  is  a  linear  trend  with  y,i  7^  0  and  p'/ii  =  0, 


V  = 


) 


WdW' )  f  I  WcWc'ds 


WdW' 


with  W  :=  Wic_ro(s),  Wc(s)  :=  [Wff-r„-i(s)',s]',  and  W(s)  as  in 
(8.2.17). 


Proofs  of  these  results  can  be  found  in  Saikkonen  &  Liitkepohl  (2000b,  d) 
and  Liitkepohl  et  al.  (2001).  The  following  remarks  may  be  of  interest. 

Remark  1  The  adjustment  for  deterministic  terms  may  appear  to  be  compli¬ 
cated  at  first  sight.  One  may,  for  instance,  wonder  why  the  deterministic  terms 
are  not  directly  estimated  by  LS  and  then  subtracted  from  the  observed  yt- 
Unfortunately,  in  the  present  case,  the  LS  estimators  do  not  have  the  same 
asymptotic  properties  as  the  EGLS  estimators  described  here  and  also  the 
resulting  cointegration  tests  will  have  different  properties.  The  present  proce¬ 
dure  is  useful  because  it  results  in  tests  with  attractive  asymptotic  properties, 
as  we  will  argue  in  the  next  remark.  ■ 
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Remark  2  Comparing  the  asymptotic  distributions  in  Propositions  8.2  and 
8.3,  it  turns  out  that  the  statistics  for  the  case  of  a  constant  deterministic 
term  (/it  =  /io)  have  the  same  asymptotic  distributions  as  the  corresponding 
LR  statistics  for  the  case  without  any  deterministic  term.  Thus,  estimation 
of  the  constant  mean  term  does  not  affect  the  asymptotic  distributions  of  the 
A 2pS  statistics,  while  it  has  an  impact  on  the  LR  statistics  in  Proposition 
8.2.  This  observation  suggests  that  the  GLS-LR  tests  may  have  better  power 
properties,  at  least  asymptotically.  This  conjecture  was  actually  confirmed  in 
a  local  power  comparison  by  Saikkonen  &  Liitkepohl  (1999).  The  situation 
is  not  as  clear  for  the  other  situations.  In  other  words,  if  there  is  a  linear 
trend  term  in  the  model,  a  local  power  comparison  does  not  lead  to  a  unique 
ranking  of  the  tests.  In  some  situations  the  LR  tests  are  preferable  and  in 
other  situations  the  GLS-LR  variants  may  be  preferable,  depending  on  the 
properties  of  the  data  generation  process.  Also,  local  power  is  an  asymptotic 
concept  which  allows  to  investigate  the  power  properties  of  tests  in  regions 
close  to  the  null  hypothesis  when  the  sample  size  goes  to  infinity.  Because 
asymptotic  theory  is  not  always  a  good  guide  for  small  sample  properties, 
these  results  do  not  guarantee  superior  performance  of  the  GLS-LR  tests, 
even  when  only  a  constant  mean  term  is  included  in  the  model.  In  particular, 
the  latter  tests  may  have  size  distortions  in  small  samples.  ■ 

Remark  3  Although  the  asymptotic  distributions  in  Proposition  8.3  look  a 
little  more  complicated  than  those  in  Proposition  8.2,  critical  values  can  again 
be  simulated  easily  because  the  asymptotic  distributions  are  still  functionals 
of  Wiener  processes.  Percentage  points  for  all  three  asymptotic  distributions 
are  tabulated  in  the  literature  (see  Johansen  (1995),  Liitkepohl  &  Saikkonen 
(2000)  and  Saikkonen  &  Liitkepohl  (2000b)).  ■ 

Remark  4  The  GLS-LR  tests  can  also  be  adopted  for  other  deterministic 
terms  such  as  higher  order  polynomials  and  seasonal  dummy  variables.  For 
the  former  case,  different  asymptotic  distributions  will  result,  whereas  sea¬ 
sonal  dummies  can  be  added  to  all  three  deterministic  terms  considered  in 
Proposition  8.3  without  affecting  the  limiting  distributions  of  the  test  statis¬ 
tics.  An  advantage  of  the  GLS-LR  tests  is  that  these  asymptotic  distributions 
are  also  not  affected  by  including  shift  dummies  in  the  deterministic  term. 
This  property  is  in  contrast  to  the  LR  tests  and  means  that  the  same  critical 
values  can  be  used  as  for  the  corresponding  tests  without  shift  dummies  (see 
Saikkonen  &  Liitkepohl  (2000c)).  In  particular,  there  is  no  need  to  compute 
new  critical  values  for  each  break  point.  Given  the  computing  power  which 
is  available  today,  this  may  not  seem  as  a  great  advantage  over  the  LR  tests 
at  first  sight.  It  makes  it  possible,  however,  to  also  consider  cases  where  the 
actual  break  date  is  unknown  and  has  to  be  estimated  in  addition  to  the  other 
parameters  of  the  process.  Liitkepohl,  Saikkonen  &  Trenkler  (2004)  consider 
that  case  and  show  that  a  number  of  different  estimators  of  the  break  date  can 
be  used  without  affecting  the  asymptotic  distributions  of  the  A 2rS  statistics 
under  the  null  hypothesis.  ■ 
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Example 

We  have  also  applied  the  GLS-LR  tests  to  the  German  interest  rate/inflation 
example  series  and  present  the  results  in  Table  8.3.  Although  the  evidence  is 
again  clearly  in  favor  of  a  cointegrating  rank  of  r  =  1,  all  tests  have  more 
trouble  rejecting  r0  =  0  if  the  larger  lag  order  is  used.  In  that  case,  the 
hypothesis  rk(II)  =  0  cannot  even  be  rejected  at  the  10%  level  if  only  a 
constant  and  seasonal  dummies  are  included  in  the  model.  Thus,  although 
the  GLS-LR  tests  have  good  local  power  properties  especially  for  this  case, 
superior  small  sample  power  is  not  guaranteed.  Of  course,  it  must  also  be  kept 
in  mind  that  a  test  with  higher  power  does  not  necessarily  reject  a  specific 
null  hypothesis  for  a  particular  data  set  more  easily  than  a  test  with  lower 
power.  Moreover,  our  theoretical  models  underlying  the  asymptotic  analysis 
may  not  fully  capture  all  features  of  the  actual  data  generation  process. 


Table  8.3.  GLS-LR  trace  tests  for  the  cointegration  rank  of  the  German  interest 
rate/inflation  system 


deterministic 

no.  of  lagged 

null 

test 

critical  values 

term 

differences 

hypothesis 

value 

10% 

5% 

constant,  seasonal  dummies 

0 

rk(n)  = 

0 

28.21 

10.35 

12.21 

rk(n)  = 

1 

0.41 

2.98 

4.14 

3 

rk(n)  = 

0 

10.13 

10.35 

12.21 

rk(n)  = 

1 

2.42 

2.98 

4.14 

orthogonal  linear  trend, 

0 

rk(II)  = 

0 

28.16 

8.03 

9.79 

seasonal  dummies 

3 

rk(n)  = 

0 

9.75 

8.03 

9.79 

linear  trend, 

0 

rk(n)  = 

0 

49.42 

13.89 

15.92 

seasonal  dummies 

rk(n)  = 

1 

1.83 

5.43 

6.83 

3 

l-k(n)  = 

0 

14.43 

13.89 

15.92 

rk(n)  = 

1 

4.71 

5.43 

6.83 

Notes:  Sample  period:  1972.2  —  1998.4  (including  presample  values).  Critical  values 
from  Johansen  (1995,  Tables  15.1),  Saikkonen  &  Liitkepohl  (2000b,  Table  1)  and 
Liitkepohl  &  Saikkonen  (2000,  Table  1)  for  the  case  of  a  constant,  an  orthogonal 
trend,  and  a  general  linear  trend,  respectively. 


8.2.8  Choice  of  Deterministic  Terms 

As  mentioned  earlier,  including  redundant  deterministic  terms  in  the  models 
on  which  cointegration  rank  tests  are  based,  may  result  in  a  substantial  loss  of 
power  (see  also  Doornik,  Hendry  &  Nielsen  (1998)  and  Hubrich  et  al.  (2001)). 
Therefore,  it  is  helpful  that  statistical  procedures  are  available  for  investi¬ 
gating  which  terms  to  include.  Johansen  (1994,  1995)  proposed  LR  tests  for 
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hypotheses  regarding  the  deterministic  terms.  These  tests  are  obvious  choices 
because  the  ML  estimators  and,  hence,  the  corresponding  maxima  of  the  like¬ 
lihood  functions  are  easy  to  compute  for  various  different  deterministic  terms 
(see  Section  7.2.4). 

Apart  from  dummy  variables,  a  linear  trend 

Mo  +  Mi  t  (8.2.20) 

is  the  most  general  deterministic  term  considered  in  the  foregoing.  A  possible 
pair  of  hypotheses  of  interest  related  to  this  term  when  /xi  7^  0  is 

H0  :  f3' n  1  =  0  versus  H\  :  /3'/xi  7^  0.  (8.2.21) 

Hence,  there  is  a  deterministic  linear  trend  in  the  variables  and  the  test  checks 
whether  the  trend  is  orthogonal  to  the  cointegration  relations.  In  other  words, 
the  test  checks  the  model  (8.2.14)  against  (8.2.16).  The  corresponding  LR  test 
has  a  standard  y2  limiting  distribution  under  the  null  hypothesis,  as  we  have 
seen  in  Section  7.2.4.  If  the  underlying  VECM  has  cointegrating  rank  r  and, 
thus,  p  is  a  ( K  x  r)  matrix,  r  zero  restrictions  are  specified  in  Hq.  Therefore 
we  have  r  degrees  of  freedom,  that  is,  the  LR  test  statistic  has  an  asymptotic 
x‘2(  r ) -dist  r  ibut  ion . 

Another  pair  of  hypotheses  of  interest  is 

H0  :  Mi  =  0  versus  Hi  :  /zi  ^  Q,(3'ni  =  0.  (8.2.22) 

In  this  case,  a  model  with  an  unrestricted  intercept,  (8.2.16),  is  tested  against 
one  where  no  linear  trend  is  present  and,  thus,  the  constant  can  be  absorbed 
into  the  cointegration  relations  as  in  (8.2.11).  Again,  the  LR  test  has  stan¬ 
dard  asymptotic  properties,  that  is,  for  a  VECM  of  dimension  K  and  with 
cointegration  rank  r,  it  has  a  y2(A"  —  r)  limiting  distribution. 

If  these  tests  are  used  for  deciding  on  the  deterministic  term  in  a  VECM,  it 
may  be  worth  keeping  in  mind  that  they  introduce  additional  uncertainty  into 
the  modelling  procedure.  The  tests  are  performed  for  a  model  with  a  specific 
cointegrating  rank.  Thus,  ideally  the  cointegrating  rank  has  to  be  determined 
before  the  deterministic  terms  are  tested,  whereas  one  motivation  for  them 
was  that  cointegrating  rank  tests  may  have  better  power  if  the  deterministic 
term  is  specified  properly.  Thus,  the  tests  present  only  a  partial  solution  to 
the  problem.  Proceeding  as  in  the  example  and  checking  the  robustness  of  the 
rank  tests  with  respect  to  different  specifications  of  the  deterministic  terms  is 
a  useful  strategy. 

8.2.9  Other  Approaches  to  Testing  for  the  Cointegrating  Rank 

The  literature  on  cointegration  rank  tests  has  grown  rapidly  in  recent  years. 
Many  related  issues  have  been  discussed  and  investigated.  Examples  are  non¬ 
normal  processes  (Lucas  (1997, 1998),  Boswijk  &  Lucas  (2002),  Caner  (1998)), 
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the  presence  of  higher  order  integration  and  long  memory  (Gonzalo  &  Lee 
(1998),  Breitung  &  Hassler  (2002)),  the  impact  of  the  dimension  of  the  data 
generation  process  (Ho  &  Sorensen  (1996))  and  using  a  reversed  sequence  of 
null  hypotheses  in  testing  for  the  cointegrating  rank  (Snell  (1999)).  Also,  a 
number  of  studies  considered  the  small  sample  properties  of  the  tests.  A  re¬ 
cent  review  of  the  related  literature  with  many  more  references  was  provided 
by  Hubrich  et  al.  (2001). 

Moreover,  a  number  of  other  test  procedures  were  proposed.  For  instance, 
Liitkepohl  &  Saikkonen  (1999a)  used  the  idea  underlying  the  causality  test 
which  was  presented  in  Section  7.6.3  and  augmented  the  number  of  lags  to 
obtain  a  y2-test  for  the  cointegrating  rank.  Bewley  &  Yang  (1995)  and  Yang  & 
Bewley  (1996)  constructed  a  test  based  on  canonical  correlations  of  the  levels 
variables.  Stock  &  Watson  (1988)  considered  the  use  of  principal  component 
analysis  and  Bierens  (1997)  presented  a  fully  nonparametric  approach  to  coin¬ 
tegration  rank  testing.  These  and  many  other  proposals  were  also  reviewed  in 
Hubrich  et  al.  (2001),  including  the  possibility  of  choosing  the  cointegrating 
rank  by  model  selection  criteria.  A  range  of  cointegration  tests  was  also  pro¬ 
posed  and  investigated  in  a  single  equation  framework  (e.g.,  Engle  &  Granger 
(1987),  Phillips  &  Ouliaris  (1990),  Banerjee  et  al.  (1993),  Choi  (1994),  Shin 
(1994),  Haug  (1996)).  They  are  of  limited  usefulness  for  the  situation  we  have 
considered  here,  where  several  cointegrating  relations  may  be  present  in  a 
system  of  variables.  Therefore,  no  details  are  presented. 


8.3  Subset  VECMs 

When  the  lag  order  and  the  cointegration  rank  of  a  VECM  have  been  deter¬ 
mined,  specifying  further  restrictions  may  be  useful  to  reduce  the  dimension¬ 
ality  of  the  parameter  space  and  thereby  improve  the  estimation  precision. 
As  we  have  seen  in  Sections  7.2  and  7.3,  the  standard  t-ratios  and  F-tests  re¬ 
tain  their  usual  asymptotic  properties  if  they  are  applied  to  the  short-run  and 
loading  parameters  of  a  VECM.  Therefore,  subset  modelling  for  cointegrated 
systems  may  be  based  on  statistical  tests.  Instead  of  using  testing  procedures, 
restrictions  for  individual  parameters  or  groups  of  parameters  may  also  be 
based  on  model  selection  criteria  in  a  similar  way  as  in  Chapter  5.  In  particu¬ 
lar,  the  strategies  applied  to  individual  equations  of  the  system  may  be  used. 
Consider,  for  instance,  the  j- th  equation  of  a  VECM, 

Ujt  =  xitOi  +  ■  ■  ■  +  XNt@N  +  Ujt,  t=l,...,T.  (8.3.1) 

Here  all  right-hand  side  regressor  variables  are  denoted  by  Xkt,  including  de¬ 
terministic  terms  and  the  cointegration  relations.  Thus,  x^t  =  where 

p-  is  the  i-th  column  of  the  cointegration  matrix  P,  is  a  possible  regressor.  If  p 
is  unknown,  it  may  be  replaced  by  a  superconsistent  estimator  P,  which  may 
be  based  on  the  unrestricted  model  and  variables  Xkt  =  P(j/t-i  may  be  added 
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as  regressors  in  (8.3.1).  Using  this  setup,  all  the  standard  procedures  described 
in  Section  5.2.8  are  available,  including  the  full  search  procedure,  sequential 
elimination  of  regressors  as  well  as  top-down  and  bottom-up  strategies. 

For  the  German  interest  rate/inflation  example  with  cointegration  rela¬ 
tion  p  yt  =  Rt  —  4 Dpt ,  we  have  used  the  sequential  elimination  of  regressors 
procedure  in  conjunction  with  the  AIC  criterion  based  on  a  search  for  restric¬ 
tions  on  individual  equations  and  found  the  following  model,  using  the  sample 
period  1973.2-1998.4  plus  the  required  presample  values: 
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(8.3.2) 


Here  f-ratios  are  given  in  parentheses  underneath  the  parameter  estimates. 
This  is  precisely  the  model  that  was  also  used  in  Section  7.3.3,  see  (7.3.9),  to 
illustrate  EGLS  estimation  and  that  procedure  is  used  here  as  well.  Notice, 
however,  that  the  search  procedure  was  based  on  LS  estimation  of  individual 
equations.  Hence,  different  f-ratios  were  the  basis  for  variable  selection.  Still, 
generally  the  coefficients  with  large  absolute  f-ratios  in  the  unrestricted  model 
(see  Table  7.1)  are  maintained  in  the  restricted  subset  VECM. 

In  the  present  example,  we  have  pretended  that  the  cointegration  relation 
is  known.  Such  an  assumption  is  not  required  for  the  subset  procedures  to  be 
applicable.  The  same  subset  model  selection  procedure  may  be  applied  if  the 
cointegration  relations  contain  estimated  parameters.  In  other  words,  it  may 
be  used  as  the  second  stage  in  a  two-stage  procedure,  where  the  cointegration 
matrix  p  is  estimated  first  and  then  the  estimated  P  matrix  is  substituted  for 
the  true  one  in  the  second  stage.  The  subset  restrictions  are  determined  in 
the  second  stage. 
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8.4  Model  Diagnostics 

Diagnostic  checking  is  also  an  important  stage  of  the  general  modelling  pro¬ 
cedure  for  VECMs.  Many  of  the  tests  for  model  adequacy  discussed  for  sta¬ 
tionary  VAR  processes  can  be  extended  to  the  VECM  case.  Tests  for  residual 
autocorrelation,  nonnormality,  and  structural  change  will  be  treated  in  turn 
in  the  following.  We  will  start  with  a  discussion  of  the  properties  of  residual 
autocorrelations  of  an  estimated  VECM.  The  underlying  model  is  assumed  to 
be  of  the  simple  form 

Ayt  =  ccp  yt—  i  +  ^i^yt-i  +  •  •  •  +  ^p-i^yt-p+i  +  ut,  (8.4.1) 

where  a  and  p  are  ( K  x  r)  matrices  of  rank  r  and  all  other  symbols  are  de¬ 
fined  as  in  (8.2.1).  We  assume  that  the  model  has  been  estimated  by  reduced 
rank  ML  or  the  two-stage  procedure  discussed  in  Section  7.2.5.  If  not  explic¬ 
itly  stated  otherwise,  no  restrictions  are  placed  on  the  loading  and  short-run 
parameters. 

8.4.1  Checking  for  Residual  Autocorrelation 

Asymptotic  Properties  of  Residual  Autocovariances  and 
Autocorrelations 

To  study  the  properties  of  the  autocovariances  and  autocorrelations  of  the 
residuals  of  a  VECM,  we  denote  the  estimated  residuals  by  ut  and  otherwise 
use  the  notation  from  Section  4.4  of  Chapter  4  and  Section  5.2.9  of  Chapter 
5,  that  is, 


Ci  : 

utu't_i, 

i  =  0, 1, . . . ,  h. 

t=i-\- 1 

ch 

:=  (C1;... 

,Ch), 

ch  :=  vec(Ch), 

are  the  residual  autocovariances  and  Ri  (i  =  0, 1, . . . ,  h), 

R/i  Rh ),  and  rh  :=  vec(Rh) 

denote  the  corresponding  residual  autocorrelations. 

To  derive  the  asymptotic  properties  of  these  quantities,  it  is  convenient 
to  also  treat  the  case  of  a  known  cointegration  matrix.  Suppose  the  short- 
run  and  loading  parameters  of  the  VECM  (8.4.1)  are  estimated  with  the 
same  method  as  before,  except  that  the  true  cointegration  matrix  is  used 
instead  of  the  estimated  one.  For  the  resulting  estimation  residuals  we  denote 
the  previously  defined  quantities  by  tildes  instead  of  hats.  In  other  words, 
we  have  C,,  C/,,  and  c h  instead  of  C, ,  C^,  and  c h,  respectively,  and  so  on. 
Briiggemann,  Liitkepohl  &  Saikkonen  (2004)  showed  that  C',  and  C)  have 
the  same  asymptotic  distributions.  More  precisely  they  proved  the  following 
lemma. 
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Lemma  8.1 

C%  -  Ct  =  O^T-1)  for  *  =  1,2 . 


Although  Briiggemann  et  al.  (2004)  showed  this  result  for  full  VECMs 
estimated  by  reduced  rank  ML  or  unrestricted  LS,  it  is  clear  from  their  proof 
that  it  also  applies  for  other  asymptotically  equivalent  estimation  methods. 
The  lemma  enables  us  to  get  the  asymptotic  distributions  of  residual  autoco¬ 
variances,  for  example,  with  the  same  arguments  as  previously  derived  results 
(see,  e.g.,  Proposition  5.7)  because,  if  the  cointegration  matrix  is  known,  all 
regressors  in  the  VECM  are  stationary  variables.  Therefore,  the  same  argu¬ 
ments  apply  as  in  Section  5.2.9  in  Chapter  5.  From  Lemma  8.1  it  then  follows 
that 

VfC\  -  y/TCi  =  op(  1) 

so  that  VTc  and  y/T  c  have  identical  asymptotic  distributions.  From  the 
asymptotic  distributions  of  the  residual  autocovariances  we  also  get  those  of 
the  residual  autocorrelations  in  the  familiar  way. 


Portmanteau  and  LM  Tests  for  Residual  Autocorrelation 

Briiggemann  et  al.  (2004)  also  showed  that  portmanteau  and  LM  tests  for 
residual  autocorrelation  can  be  used  in  conjunction  with  VECMs.  In  this 
case,  the  portmanteau  statistic 

h 

Qh  :=  T^tr(C'C0-1CiC0-1)  =  Tc'h(Ih  ®  C ^  ®  C^)ch 

i= 1 

has  an  approximate  x2(hK2  —  K2(p—  1)  —  Ar)-distribution.  Notice  that  the  de¬ 
grees  of  freedom  are  adjusted  relative  to  the  stationary  full  VAR  case.  Now  we 
subtract  from  the  number  of  autocovariances  included  in  the  statistic  ( hK 2) 
the  number  of  estimated  parameters  not  counting  the  elements  of  the  cointe¬ 
gration  matrix.  Again  this  result  follows  from  Lemma  8.1  which  allows  us  to 
treat  the  cointegration  matrix  as  known  for  asymptotic  derivations,  even  if  it 
is  estimated. 

It  may  be  worth  emphasizing  that  this  result  also  holds  if  the  VECM 
is  estimated  by  unrestricted  LS  or,  equivalently,  the  corresponding  VAR  in 
levels  is  estimated  by  unrestricted  LS.  In  other  words,  if  the  integration  and 
cointegration  properties  of  a  system  of  time  series  are  not  clear  and  an  analyst 
therefore  decides  to  use  a  levels  VAR  model,  the  portmanteau  test  cannot  be 
used  because  the  degrees  of  freedom  of  the  approximating  ^-distribution  are 
not  known.  If  one  ignores  this  problem  and  simply  uses  the  smaller  degrees 
of  freedom  for  the  stationary  full  VAR  case  (hK2  —  pK2),  the  test  is  likely  to 
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reject  a  true  null  hypothesis  far  too  often.  Also,  recall  that  the  approximate  y2- 
distribution  is  obtained  under  the  assumption  that  h  goes  to  infinity  with  the 
sample  size.  Thus,  the  portmanteau  test  is  not  suitable  for  testing  for  residual 
autocorrelation  of  low  order.  As  in  the  stationary  case,  in  small  samples  it 
may  be  preferable  to  use  the  modified  portmanteau  statistic 

h 

Qh  ■■=  T 2  XJ(T  -  ?)-1tr(a'C0-iaiCV1). 

The  asymptotic  distribution  of  the  LM  statistic  for  residual  autocorrela¬ 
tion  is  not  affected  by  the  presence  of  integrated  variables.  We  may  use  the 
auxiliary  regression  model 

Ut  =  CL^yt-i  +  l\Ayt-i  +  ■  ■  ■  +  Ip-iAyt-p+i 

+  DiUt-i  +  ■  ■  ■  +  DfrUt-h  +  £ti  t  =  1, . . .  ,T,  (8.4.2) 

with  us  =  0  for  s  <  1,  and  compute  the  LM  statistic  for  the  hypotheses 

H0  :  Di  =  ■  ■  ■  =  Dh  =  0  vs.  Hi  :  Dj  ^  0  for  at  least  one  j  €  {1, . . . ,  h}. 
The  resulting  LM  statistic  has  an  asymptotic  ^-distribution, 

A LM(h)  -4  x\hK\ 

if  the  null  hypothesis  of  no  autocorrelation  is  true,  as  in  the  stationary  case  (see 
Section  4.4.4).  In  contrast  to  the  portmanteau  test,  the  LM  test  is  especially 
useful  for  testing  for  low  order  residual  autocorrelation.  For  large  h,  it  may  in 
fact  not  be  possible  to  estimate  the  parameters  in  the  auxiliary  model  (8.4.2) 
because  of  an  insufficient  sample  size. 

Both  the  portmanteau  tests  and  the  LM  tests  are  also  applicable  for  subset 
VECMs  with  restrictions  on  the  short-run  and  loading  parameters.  In  that 
case,  modifications  analogous  to  those  described  in  Section  5.2.9  have  to  be 
used.  For  the  portmanteau  tests,  this  means  that  the  degrees  of  freedom  in  the 
approximate  distributions  have  to  be  adjusted.  More  precisely,  the  number  of 
estimated  loading  and  short-term  parameters  is  subtracted  from  the  number 
of  autocovariances  included  in  the  statistic.  Here  restricted  parameters  are 
not  counted.  For  the  LM  tests,  the  auxiliary  model  has  to  be  modified.  The 
estimated  residuals  may  now  come  from  a  two-stage  estimation  as  described 
in  Section  7.3.2.  Moreover,  the  restrictions  should  also  be  accounted  for  in  the 
auxiliary  model  as  described  in  Section  5.2.9. 

To  illustrate  these  tests,  we  have  applied  them  to  the  subset  VECM  (8.3.2) 
for  the  German  interest  rate/inflation  example  data.  In  this  case,  the  cointe¬ 
gration  relation  is  assumed  to  be  known.  According  to  our  previous  results, 
the  same  asymptotic  distributions  of  the  autocorrelation  test  statistics  are  ob¬ 
tained  for  an  estimated  cointegration  relation.  Moreover,  deterministic  terms 
are  included  in  the  model  (8.3.2).  Again,  it  can  be  shown  that  such  terms  do 
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Table  8.4.  Residual  autocorrelation  tests  for  subset  VECM  (8.3.2) 


test 

Alm(1)  Alm(2)  \lm{3)  Alm(4) 

Q  24 

Q  24 

Q'M 

Qm 

test  statistic 

3.91 

6.62 

6.89 

10.26 

77.2 

89.3 

93.5 

111.5 

approximate 

distribution 

X2(4) 

X2(8) 

X2(12) 

X2(16) 

X2(86)  x2(86) 

X2(110)  x2(no) 

p- value 

0.42 

0.58 

0.86 

0.85 

0.74 

0.38 

0.87 

0.44 

not  affect  the  asymptotic  distributions  of  the  portmanteau  and  LM  tests  for 
residual  autocorrelation  (see  Briiggemann  et  al.  (2004)  for  details). 

Both  types  of  tests  have  been  applied  with  different  lag  orders  h  and 
the  results  are  given  in  Table  8.4.  The  LM  tests  are  useful  for  testing  for  low 
order  residual  autocorrelation.  Therefore,  only  lags  one  to  four  are  considered. 
Clearly,  for  a  very  long  lag  length  (high  order  autocorrelation)  the  degrees  of 
freedom  may  be  exhausted  in  the  auxiliary  regression.  In  contrast,  the  lag 
length  h  has  to  be  large  for  the  approximate  ^-distribution  to  be  valid  for 
the  portmanteau  tests.  Therefore,  only  large  lag  orders  are  considered  for 
these  tests.  All  asymptotic  p-values  in  Table  8.4  are  substantially  larger  than 
conventional  significance  levels  for  such  tests.  Hence,  there  is  no  apparent 
residual  autocorrelation  problem  for  our  example  model. 

8.4.2  Testing  for  Nonnormality 

The  tests  for  nonnormality  considered  in  Chapter  4,  Section  4.5,  are  based 
on  the  estimated  residuals  from  a  VAR  process.  We  can  use  the  residuals  of 
a  VECM  instead  without  affecting  the  asymptotic  distributions  of  the  test 
statistics.  This  result  follows  again  from  the  previously  used  superconsistency 
of  the  estimator  for  the  cointegration  matrix  and  the  properties  of  the  em¬ 
pirical  moment  matrices  of  integrated  variables  (see  also  Kilian  &  Demiroglu 
(2000)). 

8.4.3  Tests  for  Structural  Change 

Time  invariance  is  an  important  property  of  a  VECM  for  valid  statistical 
inference  as  well  as  for  proper  economic  analysis  and  forecasting.  Therefore, 
tests  for  structural  change  are  also  important  tools  for  diagnostic  checking  of 
VECMs.  The  Chow  tests  and  the  prediction  tests  considered  in  Section  4.6  for 
stationary  VARs  can  be  extended  easily  to  the  case  of  cointegrated  systems. 
We  will  discuss  both  types  of  tests  in  the  following. 

Chow  Tests 

Analogously  to  Section  4.6.1,  in  deriving  the  Chow  tests,  we  assume  that 
a  change  in  the  parameters  of  the  VECM  (8.4.1)  is  suspected  after  period 
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T i  <  T.  For  a  sample  yi, ...  ,yx  plus  the  required  presample  values,  we  can 
then  set  up  the  model  in  two  parts: 

AYW  =  a(i)P(i)^  i(i)  +  r(i)Ar(1)  +  C/ (!)  (8.4.3) 

and 

AY(2)  =  a(2)P(2)^-  1(2)  +  r(2)/iX(2)  +  f/(  2),  (8.4.4) 

where  Z\F(1)  :=  [Z\y1; . . . ,  Zh/Tl],  2\F(2)  :=  [Z\yTl+i,  •  •  • ,  Ayr\  and  the  other 
data  matrices  are  partitioned  accordingly.  The  parameter  matrices  OC(j),  P^ 
and  r(i)  :=  [IVo, ....  contain  the  values  for  the  i-tli  subperiod,  where 

i  =  1,2.  Using  similar  arguments  as  in  the  proof  of  Proposition  7.3,  it  follows 
that  the  ML  estimators  of  these  parameter  matrices  can  be  determined  by  two 
separate  reduced  rank  regressions  applied  to  each  of  the  two  models  (see  also 
Problem  8.5).  Notice  that  the  presample  values  used  in  the  second  subsample 
coincide  with  the  last  observations  of  the  first  subperiod.  To  avoid  this  over¬ 
lap,  one  may  consider  starting  the  second  subsample  only  with  observation 
(Ui+p+i  ■  Such  a  modification  may  have  advantages  in  small  samples  if  there 
is  actually  a  structural  break.  If  the  null  hypothesis  of  constant  parameters 
in  both  subperiods  is  tested,  however,  there  is  no  strong  case  for  dropping 
observations  between  the  two  subsamples  because,  under  the  null  hypothesis, 
all  observations  are  generated  by  the  same  process. 

Assuming,  as  in  Section  4.6.1,  that  both  parts  of  the  sample  go  to  infinity 
at  a  fixed  proportion  when  T  gets  large,  the  asymptotic  theory  of  Section  7.2 
can  be  applied  to  derive  the  asymptotic  distributions  of  the  estimators.  These 
asymptotic  results  can  then  be  used  to  test  parameter  constancy  hypotheses 
of  the  type 

#0  :  P(i)  =  P(2pa(l)  =  a(2),r(1)  =  T(2)  (8.4.5) 

against  the  alternative  that  at  least  one  of  the  equalities  is  violated.  From 
the  results  in  Section  7.2,  it  follows  that  the  relevant  Wald  or  LR  tests  have 
asymptotic  ^-distributions.  To  determine  the  number  of  degrees  of  freedom, 
it  has  to  be  kept  in  mind,  however,  that  a  nonsingular  asymptotic  distribution 
for  the  estimator  of  p  is  only  obtained  upon  suitable  normalization.  Hence, 
the  equalities  p^  =  p(2)  account  only  for  r(K  —  r)  restrictions  so  that  the  LR 
statistic  corresponding  to  (8.4.5)  has  a  limiting  %2-distribution  with  r(K  — 
r)  +  rK  +  (p  —  1  )K2  degrees  of  freedom.  It  is  also  possible  to  construct  similar 
tests  for  constancy  of  only  a  subset  of  the  parameters  (see  Hansen  (2003)). 
Moreover,  the  tests  can  be  extended  to  models  with  deterministic  terms. 

Prediction  Tests  for  Structural  Change 

In  Chapter  4,  Section  4.6.2,  we  have  considered  two  tests  for  structural  change 
that  may  be  applied  with  small  modifications  if  the  data  generation  process 
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is  integrated  or  cointegrated.  To  see  this,  consider  a  A'-dimensional  Gaussian 
VECM  with  cointegration  rank  r,  as  in  (8.4.1).  Denoting  the  optimal  h-step 
forecast  at  origin  T  by  yrih)  and  its  MSE  matrix  by  Uy(h),  as  in  Section  6.5, 
the  quantity 

Th=  [yr+h  -  yr(h)]' Sy{h)~1[yT+h  -  yr(h)\  (8.4.6) 

has  a  x2 (-^0-distribution  (see  Section  4.6.2).  If  the  parameters  of  the  process 
were  known,  this  statistic  could  be  used  to  test  whether  yr+h  is  generated  by 
a  Gaussian  process  of  the  type  (8.4.1). 

In  practice,  the  process  parameters  have  to  be  replaced  by  estimators  and, 
in  Section  4.6.2,  we  have  modified  the  forecast  MSE  matrix  accordingly.  In 
Section  7.5,  we  have  seen  that  the  MSE  approximation  used  for  stationary, 
stable  processes  is  not  appropriate  in  the  present  integrated  case.  Therefore, 
we  propose  the  statistic 

Tt  =  [ VT+h  -  yT(h)]' Sy(h)~1[yT+h  ~  yT(h)\/K,  (8.4.7) 

which  has  an  approximate  F{K,  T  —  Kp  —  l)-distribution.  Here 

yr(h)  =  MyT{h  -  1)  H - b  ApyT(h-p), 

with  yr{j)  '■=  Vt+j  for  j  <  0,  and  the  H,’s  are  the  ML  estimators  of  the  Aj’s 
obtained  from  ML  estimation  of  the  VECM  and  converting  to  the  levels  VAR 
representation.  Moreover, 

h- 1 

Ey(h)  =  j2$iZu$i, 

i= 0 

where  Su  is  the  ML  estimator  of  Su  (see  Proposition  7.3)  and  the  ^,’s  are 
computed  from  the  Aj’s  by  the  recursions  in  (6.5.5).  The  F  approximation  to 
the  distribution  of  t (f  follows  by  noting  that 

plim(rh  -  Krf)  =  0. 

Hence  Kt *  has  an  asymptotic  x'2  ( 7Q -distribution  and 

t*  w  x2(I<)/K  «  F(K,  T  -  Kp  —  1),  (8.4.8) 

where  the  numerator  degrees  of  freedom  are  chosen  in  analogy  with  the  sta¬ 
tionary  case.  The  quality  of  the  F  approximation  in  small  samples  is  presently 
unknown. 

A  test  based  on  several  forecasts,  as  discussed  in  Section  4.6.2,  may  be 
generalized  to  integrated  processes  in  a  similar  way.  We  may  use 

h 

Xh  =  Tj2  t2T+iSZ1'b+i/[{T  +  Kp  +  1  )Kh] 

1=1 


(8.4.9) 
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as  a  test  statistic  with  an  approximate  F(Kh7  T  —  Kp  —  l)-distribution.  Here 
the  ur+i  s  are  the  residuals  obtained  for  the  postsample  period  by  using  the 
ML  estimators.  The  approximate  distribution  follows  from  asymptotic  theory 
as  in  the  stationary,  stable  case  (see  Problem  8.7). 

A  number  of  other  tests  for  structural  change  are  available  for  VECMs. 
For  instance,  Hansen  &  Johansen  (1999)  proposed  tests  which  are  based  on 
the  eigenvalues  from  the  ML  estimation  procedure  (see  Proposition  7.3). 


8.5  Exercises 

8.5.1  Algebraic  Exercises 

Problem  8.1 

Consider  the  model  yt  =  Ht  +  as  in  Section  8.2,  for  quarterly  series  with 
deterministic  term 

3 

yt)  +  ^  '  fiiSit: 

i- 1 

where  y0  and  5i  ( i  =  1,2,3)  are  ( K  x  1)  parameter  vectors  and  the  seasonal 
dummies  are  denoted  by  sa,  that  is,  su  has  a  value  of  1  in  season  i  and  —1/3 
otherwise.  Show  that  the  VECM  for  yt  can  be  written  as 

3 

Ayt  =  +  riZh/t_i  +  •  •  •  +  Tp-iAyt—p+i  +  5*su  +  ut- 

i- 1 

Show  also  that  the  vector  (su,  S2t,  S3t,  s^t)'  is  orthogonal  to  (1,1, 1,1)'.  In 
other  words,  the  seasonal  dummies  are  orthogonal  to  the  constant  term. 

Problem  8.2 

Use  Proposition  C.18  from  Appendix  C.8.2  to  construct  a  mechanism  for 
approximating  the  distribution 

W°dW'j  W°W°'dsj  W°dW' 

in  (8.2.12)  via  simulation. 

Problem  8.3 

Write  down  the  EGLS  estimation  problem  for  the  cointegration  rank  tests 
described  in  Section  8.2.7  for  a  model  with  yt  =  yo- 

Problem  8.4 

Consider  residual  autocorrelation  tests  for  a  three-dimensional  VECM  with 
two  lagged  differences  (p  =  3)  and  a  cointegrating  rank  of  r  =  2.  What  are  the 
approximate  distributions  of  the  Q 20,  Q 25,  and  Q 30  portmanteau  statistics? 
What  are  the  asymptotic  distributions  of  \lm{ 2)  and  Alm(5)? 
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Problem  8.5 

Show  that,  for  a  sample  yi,...,yr  with  a  possible  structural  break  in  period 
rl\,  1  <  Ti  <  T,  a  VECM  can  be  estimated  by  two  separate  reduced  rank 
regressions  as  in  Proposition  7.3.  (Hint:  Use  similar  arguments  as  in  the  proof 
of  Proposition  7.3.) 

Problem  8.6 
Consider  the  model 

[*YW  :  AYi2)]  =  ap'Y_1+[r(1)  :  r(2)]  [  AXW  °  1  +  [u{1)  :  U{2)\  , 

where  the  symbols  are  defined  as  in  (8.4.3)  and  (8.4.4).  Derive  the  ML  es¬ 
timators  of  the  parameters.  (Hint:  Use  similar  arguments  as  in  the  proof  of 
Proposition  7.3.) 

Problem  8.7 

Under  the  conditions  of  Section  8.4.3,  show  that 
(T  +  Kp  +  l)Kh\h/TSx2(hK), 
where  A h  is  the  statistic  defined  in  (8.4.9). 

8.5.2  Numerical  Exercises 

The  following  problems  are  based  on  the  U.S.  data  given  in  File  E3  and 
described  in  Section  7.4.3.  The  variables  are  defined  in  the  same  way  as  in 
that  section. 

Problem  8.8 

Use  a  maximum  order  of  10  and  determine  the  VAR  order  of  the  example 
system  by  the  three  model  selection  criteria  AIC,  HQ,  and  SC. 

Problem  8.9 

Assume  that  the  data  are  generated  by  a  VAR(3)  process  and  determine  the 
cointegration  rank  with  the  tests  described  in  Section  8.2. 

Problem  8.10 

Modify  the  AIC  criterion  appropriately  and  choose  the  order  and  cointegration 
rank  simultaneously  with  this  criterion.  Compare  the  result  with  that  from 
Problem  8.9. 

Problem  8.11 

Apply  the  ML  procedure  described  in  Section  7.2.3  and  the  EGLS  estimator  of 
Section  7.2.2  to  estimate  the  cointegration  relation  and  the  other  parameters 
of  a  VECM  with  cointegration  rank  r  —  1,  two  lagged  differences  (i.e.,  p  =  3) 
and  an  intercept.  Compare  the  estimates. 

Problem  8.12 

Use  diagnostic  tests  to  check  the  adequacy  of  the  model  estimated  in  Problem 

8.11. 
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In  Parts  I  and  II,  we  have  assumed  that  the  time  series  of  interest  are  gener¬ 
ated  by  stationary  or  cointegrated  reduced  form  VAR  processes.  In  this  part, 
structural  models  and  systems  with  unmodelled,  exogenous  variables  are  dis¬ 
cussed.  In  Chapter  9,  structural  VARs  and  VECMs  are  considered  and,  in 
Chapter  10,  conditional  or  partial  models  are  treated,  where  we  condition 
on  some  variables  whose  generation  process  is  not  part  of  the  model.  These 
systems  may  be  stationary  if  the  unmodelled  variables  are  generated  by  sta¬ 
tionary  processes.  Alternatively,  some  or  all  of  the  unmodelled  variables  may 
be  nonstochastic  fixed  quantities.  In  that  case,  the  mean  vectors  of  the  time 
series  variables  of  interest  may  be  time  varying  and,  hence,  the  series  may 
not  be  stationary.  They  may  still  be  stationary  when  the  deterministic  terms 
are  removed,  however.  Generally,  some  of  the  endogenous  and  unmodelled 
stochastic  variables  may  be  integrated  and  have  stochastic  trends.  Suitable 
models  for  this  case  will  also  be  considered  in  Chapter  10. 


9 


Structural  VARs  and  VECMs 


In  Chapters  2  and  6,  we  have  seen  that,  on  the  one  hand,  impulse  responses 
are  an  important  tool  to  uncover  the  relations  between  the  variables  in  a 
VAR  or  VECM  and,  on  the  other  hand,  there  are  some  obstacles  in  their  in¬ 
terpretation.  In  particular,  impulse  responses  are  generally  not  unique  and  it 
is  often  not  clear  which  set  of  impulse  responses  actually  reflects  the  ongoings 
in  a  given  system.  Because  the  different  sets  of  impulses  can  be  computed 
from  the  same  underlying  VAR  or  VECM,  it  is  clear  that  nonsample  informa¬ 
tion  has  to  be  used  to  decide  on  the  proper  set  for  a  particular  given  model. 
In  econometric  terminology,  VARs  are  reduced  form  models  and  structural 
restrictions  are  required  to  identify  the  relevant  innovations  and  impulse  re¬ 
sponses.  In  this  chapter,  different  possible  restrictions  that  have  been  proposed 
in  the  literature  will  be  considered.  The  resulting  models  are  known  as  struc¬ 
tural  VAR.  (SVAR)  models  (see,  e.g.,  Sims  (1981,  1986),  Bernanke  (1986), 
Shapiro  &  Watson  (1988),  Blanchard  &  Quah  (1989))  or  structural  VECMs 
(SVECMs)  (e.g.,  King,  Plosser,  Stock  &  Watson  (1991),  Jacobson,  Vredin  & 
Warne  (1997),  Gonzalo  &  Ng  (2001),  Breitung,  Bruggemann  &  Liitkepohl 
(2004)). 

In  the  next  section,  structural  restrictions  will  be  discussed  for  stationary 
processes.  Some  of  them  will  also  be  relevant  for  VARs  with  integrated  vari¬ 
ables.  Such  variables  are  explicitly  taken  into  account  in  VECMs  for  which 
structural  restrictions  will  be  discussed  in  Section  9.2.  It  will  be  seen  that 
VECMs  offer  additional  possibilities  for  structural  restrictions.  The  general 
modelling  strategy  for  both  SVARs  and  SVECMs  is  to  specify  and  estimate 
a  reduced  form  model  first  and  then  focus  on  the  structural  parameters  and 
the  resulting  structural  impulse  responses.  Estimation  of  structural  VARs  and 
VECMs  will  be  discussed  in  Section  9.3  and  impulse  response  analysis  and 
forecast  error  variance  decomposition  based  on  such  models  are  considered  in 
Section  9.4.  Some  extensions  of  the  setup  used  in  this  chapter  are  pointed  out 
in  Section  9.5. 
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9.1  Structural  Vector  Autoregressions 

Our  point  of  departure  is  a  AT-dimensional  stationary,  stable  VAR.(p)  process, 

yt  =  Aiyt-i  +  •  •  •  +  Apyt-p  +  lit,  (9.1.1) 

where,  as  usual,  yt  is  a  ( K  x  1)  vector  of  observable  time  series  variables,  the 
Aj’s  (j  =  1, ...  ,p)  are  ( K  x  K)  coefficient  matrices  and  ut  is  AT-dimensional 
white  noise  with  ut  ~  (0,  Su).  Deterministic  terms  have  been  excluded  for 
simplicity.  In  other  words,  we  just  consider  the  stochastic  part  of  a  data  gen¬ 
eration  process  because  it  is  the  part  of  interest  from  the  point  of  view  of 
structural  modelling  and  impulse  response  analysis.  From  Chapter  2,  it  is 
known  that  the  process  (9.1.1)  has  a  Wold  MA  representation 


yt  =  ut  +  #iWt_i  +  $2Wt-2  H - , 


(9.1.2) 


where 

& 

=  s  =  l,2,...,  (9.1.3) 

3=  1 

with  =  Ik- 

In  Chapter  2,  we  have  also  seen  that  the  elements  of  the  <I>/  matrices  are  the 
forecast  error  impulse  responses.  They  may  not  reflect  the  relations  between 
the  variables  properly  because  the  components  of  Ut  may  be  instantaneously 
correlated,  that  is,  Su  may  not  be  a  diagonal  matrix.  Thus,  isolated  shocks  in 
the  components  of  i it  may  not  be  likely  in  practice.  From  Chapter  2,  we  also 
know  that  there  are  different  ways  to  orthogonalize  the  impulses.  One  possibil¬ 
ity  is  based  on  a  Choleski  decomposition  of  the  white  noise  covariance  matrix, 
Uu  =  PP' ,  where  P  is  a  lower-triangular  matrix  with  positive  elements  on 
the  main  diagonal.  Again  such  an  approach  is  arbitrary  and  therefore  unsat¬ 
isfactory,  unless  there  are  special  reasons  for  a  recursive  structure.  We  will 
now  discuss  different  ways  to  use  nonsample  information  in  specifying  unique 
innovations  and,  hence,  unique  impulse  responses.  The  relevant  models  will 
be  referred  to  as  A-model,  B-model  and  AB-model.  The  latter  label  was  also 
used  by  Amisano  &  Giannini  (1997).  The  models  will  be  considered  in  turn 
in  the  following. 

9.1.1  The  A-Model 

A  conventional  approach  to  finding  a  model  with  instantaneously  uncorre¬ 
lated  residuals  is  to  model  the  instantaneous  relations  between  the  observable 
variables  directly.  That  may  be  done  by  considering  a  structural  form  model, 


Ayt  —  A\yt_\  +  •  •  •  +  A*yt-P  +  £t 


(9.1.4) 
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where  A*  :=  AAj  ( j  =  1  and  st  :=  A  ut  ~  (0,  S£  =  ASUA').  Thus, 

for  a  proper  choice  of  A,  et  will  have  a  diagonal  covariance  matrix.  An  MA 
representation  based  on  the  £t  is  given  by 

Ut  =  6*o  £t  +  6>iet_i  +  02£t-2  +  ■  •  ■  ,  (9.1.5) 

where  Oj  =  (F,  A-1  ( j  =  0, 1, 2, . . . ).  The  elements  of  the  Oj  matrices  represent 
the  responses  to  et  shocks.  If  an  identified  structural  form  (9.1.4)  can  be  found, 
the  corresponding  impulse  responses  will  be  unique. 

It  may  be  worth  reflecting  a  little  on  the  restrictions  required  for  a  unique 
matrix  A  of  instantaneous  effects.  From  the  relation 


Ze  =  A£„A' 


and  the  assumption  of  a  diagonal  SE  matrix,  we  get  K{K  —  l)/2  independent 
equations,  that  is,  all  K(K  —  1) /2  off-diagonal  elements  of  ASUA'  are  equal  to 
zero.  To  solve  uniquely  for  all  K 2  elements  of  A,  we  need  a  set  of  K 2  equations, 
however.  In  other  words,  we  need  I\(K  + 1)/2  additional  equations.  They  may 
be  set  up  in  the  form  of  restrictions  for  the  elements  of  A.  Clearly,  we  may  want 
to  choose  the  diagonal  elements  of  A  to  be  unity.  This  normalization  enables 
us  to  write  the  /c-th  equation  of  (9.1.4)  with  as  the  left-hand  variable.  In 
addition  to  this  normalization,  we  still  need  another  K(K  —  l)/2  restrictions. 
Such  restrictions  have  to  come  from  nonsample  sources.  For  example,  if  a  Wold 
causal  ordering  is  possible,  where  y k  may  have  an  instantaneous  impact  on 
all  the  other  variables,  y2t  may  have  an  instantaneous  impact  on  all  other 
variables  except  yit,  and  so  on  (see  Section  2.3.2),  then 


1  0  ...  0 
a2i  1  0 

aKl  aK2  ■  ■  ■  1 


is  a  lower-triangular  matrix.  Thus,  we  have  just  enough  restrictions  ( K(K  — 
l)/2  zeros  above  the  main  diagonal)  so  that  the  innovations  and  the  associated 
impulse  responses  are  just-identified.  The  zeros  can  also  appear  in  a  different 
arrangement  as  off-diagonal  elements  of  A.  There  can  also  be  more  than  K(K— 
l)/2  restrictions,  of  course.  In  SVAR  modelling  it  is  common,  however,  that 
just-identified  models  are  considered.  In  other  words,  only  as  few  restrictions 
are  imposed  as  are  necessary  for  obtaining  unique  impulse  responses.  If  at 
some  stage  of  the  analysis  it  turns  out  that  further  restrictions  are  compatible 
with  the  data,  it  is  also  possible  to  impose  them,  of  course. 

In  the  presently  considered  model,  the  identifying  restrictions  are  imposed 
on  the  matrix  A  such  that  ey  =  Aut  has  a  diagonal  covariance  matrix.  This 
model  will  be  called  the  A-model  in  the  following.  Given  the  way  we  have 
introduced  the  associated  restrictions,  it  is  plausible  to  assume  that  A  has 
a  unit  main  diagonal.  In  that  case  K(K  —  l)/2  restrictions  are  required  for 
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the  off-diagonal  elements  of  A  to  ensure  just-identified  shocks  £t  and,  hence, 
just-identified  impulse  responses.  If  the  restrictions  are  such  that  A  is  lower- 
triangular,  the  same  is  true  for  A-1.  Thus,  the  resulting  03  impulse  responses 
are  qualitatively  the  same  as  the  orthogonalizecl  impulse  responses  based  on 
a  Choleski  decomposition  of  Eu  which  were  considered  in  Chapter  2.  The 
only  difference  is  that,  for  the  latter  case,  the  wt  impulses  have  unit  variances 
which  may  not  be  the  case  for  the  presently  considered  ey  impulses. 

Regarding  the  restrictions  for  A,  it  should  be  understood  that  they  cannot 
be  arbitrary  restrictions.  Writing  them  in  the  form  C/\vec(A)  =  ca,  where  Ca 
is  a  ( \K{K  +  1)  x  K 2)  selection  matrix  and  ca  is  a  suitable  {\K{K  +  1)  x  1) 
fixed  vector,  the  restrictions  have  to  be  such  that  the  system  of  equations 

A_1T,eA'_1  =  Su  and  C^vecfA)  =  ca  (9.1.6) 


has  a  unique  solution,  at  least  locally.  Clearly,  this  system  is  nonlinear  in  A. 
Therefore,  we  can  only  hope  for  local  uniqueness  or  identification  in  general. 
The  following  proposition  gives  a  necessary  and  sufficient  condition  for  (9.1.6) 
to  have  a  locally  unique  solution  and,  thus,  for  local  identification  of  the 
structural  parameters. 

Proposition  9.1  ( Identification  of  the  A -Model) 

Let  Se  be  a  ( K  x  K)  positive  definite  diagonal  matrix  and  let  A  be  a  ( K  x  K) 
nonsingular  matrix.  Then,  for  a  given  symmetric,  positive  definite  ( AT  x  K ) 
matrix  Su,  an  ( N  x  A'2)  matrix  Ca  and  a  fixed  ( N  x  1)  vector  ca,  the  system 
of  equations  in  (9.1.6)  has  a  locally  unique  solution  for  A  and  the  diagonal 
elements  of  Se  if  and  only  if 


rk 


-2D+(XU®  A-1)  D+(A-1®A-l)Dif 

CA  0 

0  CCT 


K2  +  \K{K  +  \). 


Here  DA  is  a  (A'2  x  \K(K  + 1))  duplication  matrix,  :=  (D^,Dx)-1D^, 
and  C„  is  a  (|AT(AT  —  1)  x  ),  K  ( K  +  1))  selection  matrix  which  selects  the 
elements  of  vech(2fe)  below  the  main  diagonal.  ■ 

Proof:  For  an  n-dimensional  function  ip(x)  of  the  m-dimensional  vector  x, 
the  system  of  equations  tp(x)  =  0  can  be  solved  locally  uniquely  for  x  in  a 
neighborhood  of  a  given  vector  xq  if  and  only  if  ~Pt(dy>/dx'\x=Xo)  =  m  (see, 
e.g.,  Rothenberg  (1971,  Theorem  6)).  Hence,  considering  the  function 

vec(A-1reA'-1  -  Su)  ~ 

CAve c(A)  -  cA 
CCT  vech(27£) 

a  locally  unique  solution  for  A  and  vech(T’e)  exists  for  a  given  Su  if  and  only 
if 


vec(A) 

vech(27£) 
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d vech( A  lEeA'~1) 
dvec(A)' 

C^ 

0 


<9vech(A_1I7£A,_1) 

9vech(X’e)' 

0 

Ca 


=  K2  +  \K{K  +  \). 


Taking  into  account  that  the  off-diagonal  elements  of  Ee  are  uniquely  deter¬ 
mined  by  CVvechfZV)  =  0,  a  locally  unique  solution  for  A  and  the  diagonal 
elements  of  Se  exists  if  and  only  if  the  rank  condition  is  satisfied.  Thus,  the 
proposition  follows  by  using  the  rules  for  matrix  and  vector  differentiation 
from  Appendix  A.  13  and  noting  that 
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d  vec(A)' 
<9vec(A_1) 
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-D+(/K2 
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<9  vec(A-1)' 
A-T,)av“(A"1)1 


<9vec(A 


<9vec(A-1)' 

k^)(a-1t’£®/^)(a 


<9vec(A)' 
,_1  A-1) 


where  K^'/c  denotes  a  (A'2  x  A'2)  commutation  matrix  and  the  last  equality 
sign  holds  because  D^K kk  =  (see  Appendix  A.  12.2).  ■ 


Although  this  proposition  provides  a  condition  for  local  identification  of 
the  A-model  only,  a  globally  unique  solution  is  obtained  if  the  diagonal  el¬ 
ements  of  A  are  restricted  to  1.  A  discussion  of  the  nonuniqueness  problem 
resulting  from  sign  changes  of  some  elements  will  be  deferred  to  Section  9.1.2. 

For  practical  purposes,  it  is  problematic  that  the  identification  condition 
in  Proposition  9.1  involves  unknown  parameters.  Therefore,  strictly  speaking, 
it  can  only  be  checked  when  the  true  parameters  are  known.  In  practice,  the 
unknown  quantities  may  be  replaced  by  estimates  and  the  condition  may  be 
checked  using  the  estimated  matrix  because  it  can  be  shown  that  the  rank  of 
the  relevant  matrix  is  either  smaller  than  K 2  +  ^K(K  +  1)  everywhere  in  the 
parameter  space  or  the  rank  condition  is  satisfied  almost  everywhere.  In  the 
latter  case,  it  can  fail  only  on  a  set  of  Lebesgue  measure  zero.  Thus,  if  a  ran¬ 
domly  drawn  vector  from  the  parameter  space  is  considered,  it  should  satisfy 
the  rank  condition  with  probability  one,  if  the  model  is  locally  identified.  In 
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any  case,  C’a  must  have  at  least  K(K  +  l)/2  rows  to  ensure  identification. 
In  other  words,  having  K(K  +  l)/2  restrictions  is  a  necessary  condition  for 
identification. 

Although  we  have  stated  the  restrictions  for  the  A  matrix  in  the  form 
CAvec(A)  =  ca  in  the  foregoing,  we  note  that  they  can  be  written  alternatively 
in  the  form 

vec(A)  =  I?a7a  +  rA, 

where  i?A  and  ?’a  are  a  suitable  fixed  matrix  and  a  suitable  vector,  respectively, 
and  7a  is  the  vector  of  unrestricted  parameters  (see  Chapter  5,  Section  5.2.1). 


9.1.2  The  B-Model 

Generally,  in  impulse  response  analysis  the  emphasis  has  shifted  from  specify¬ 
ing  the  relations  between  the  observable  variables  directly  to  interpreting  the 
unexpected  part  of  their  changes  or  the  shocks.  Therefore,  it  is  not  uncom¬ 
mon  to  identify  the  structural  innovations  et  directly  from  the  forecast  errors 
or  reduced  form  residuals  ut.  One  way  to  do  so  is  to  think  of  the  forecast 
errors  as  linear  functions  of  the  structural  innovations.  In  that  case,  we  have 
the  relations  ut  =  Bet.  Hence,  Uu  =  BSeB' .  Normalizing  the  variances  of  the 
structural  innovations  to  one,  i.e.,  assuming  et  ~  (0,1k),  gives 

Su  =  BB'.  (9.1.7) 

Due  to  the  symmetry  of  the  covariance  matrix,  these  relations  specify  only 
K(K+l)/2  different  equations  and  we  need  again  K(K— 1)/2  further  relations 
to  identify  all  K2  elements  of  B.  As  in  the  previous  A-model  case,  choosing 
B  to  be  lower-triangular,  for  example,  provides  sufficiently  many  restrictions. 
Hence,  choosing  B  by  a  Choleski  decomposition  solves  the  identification  or 
uniqueness  problem,  as  we  have  also  seen  in  Chapter  2,  Section  2.3.2.  Now  it 
is  assumed,  however,  that  this  recursive  structure  is  chosen  only  if  it  has  some 
theoretical  justification  so  that  the  ey’s  can  be  regarded  as  structural  innova¬ 
tions.  This  property  makes  them  potentially  different  from  the  wt  innovations 
in  Chapter  2  which  were  obtained  by  a  mechanical  application  of  the  Choleski 
decomposition.  In  principle,  there  could  be  other  zero  restrictions  for  B  in  the 
present  context.  The  triangular  form  is  just  an  example.  In  practice,  it  is  per¬ 
haps  the  most  important  case  (e.g.,  Eichenbaum  &  Evans  (1995),  Christiano, 
Eichenbaum  &  Evans  (1996)). 

The  present  model  with 

ut  =  Bet 

and  Et  ~  (0,  Ik)  will  be  called  B-model  in  the  following  and  it  is  worth  remem¬ 
bering  that  at  least  K(K  —  l)/2  restrictions  have  to  be  imposed  to  identify 
B.  If  there  are  just  zero  restrictions  they  can  be  written  in  the  form 


CBvec(B)  =  0, 
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(9.1.8) 


where  C b  is  an  (N  x  K2)  selection  matrix.  A  necessary  and  sufficient  rank 
condition  for  local  identification  of  the  model  is  given  in  the  next  proposition. 

Proposition  9.2  ( Local  Identification  of  the  B -Model) 

Let  B  be  a  nonsingular  (K  x  K)  matrix.  Then,  for  a  given  symmetric,  positive 
definite  ( K  x  K )  matrix  Uu  and  an  (N  x  K 2)  matrix  Cb,  the  system  of 
equations  in  (9.1.7) /(9.1.8)  has  a  locally  unique  solution  if  and  only  if 


rk 


2D+(B  ®1K) 

CB 


=  K2. 


Proof:  Using  the  same  kind  of  reasoning  as  in  the  proof  of  Proposition  9.1, 
the  result  of  Proposition  9.2  follows  by  noting  that 

~  '  ’  |RV  ^  =  D^(7x2  +  K^)(B  0  Ik)  =  2D/(B  0  IK). 
ovec(B)' 


A  necessary  condition  for  the  ((^K(K  +  1)  +  N)  x  K 2)  matrix 

'  2D+(B  ®1K)  ' 

CB 

to  have  rank  K2  is  that  N  =  \K(K  —  1).  In  other  words,  we  need  ^K(K  —  1) 
restrictions  for  identification,  as  mentioned  earlier. 

It  is  easy  to  see  that  the  solution  of  the  system  (9.1.7)/ (9.1.8)  will  not  be 
globally  unique  because  for  any  matrix  B  satisfying  the  equations,  —  B  will 
also  be  a  solution.  This  result  is  due  to  the  fact  that  B  enters  the  equations 
(9.1.7)  in  “squared”  form.  In  fact,  for  any  solution  B,  the  matrix  BA  will  also 
be  a  solution  for  any  diagonal  matrix  A  which  has  only  1  and  —1  elements 
on  the  main  diagonal.  Obviously,  if  B  is  such  that  (9.1.7)  and  (9.1.8)  hold, 
Uu  =  BA/l'B'  also  holds  because  AN  =  Ik-  Moreover, 

CBvec(BA)  =  Cb(A  0  /^)vec(B)  =  0, 

because  for  each  element  by  =  0  we  have  —by  =  0.  Thus,  each  column  of  B  can 
be  replaced  by  a  column  with  opposite  sign.  Hence,  the  restrictions  in  (9.1.8) 
identify  B  only  locally  in  general.  Uniqueness  can  potentially  be  obtained  by 
fixing  the  signs  of  the  diagonal  elements,  however.  The  signs  of  the  diagonal 
elements  of  B  determine  the  signs  of  shocks.  Thus,  if  we  want  to  study  the 
effect  of  a  positive  shock  to  a  particular  variable  while  the  corresponding 
diagonal  element  of  B  is  negative,  we  can  just  reverse  the  signs  of  all  elements 
in  the  relevant  column  of  B  or,  in  other  words,  we  can  just  reverse  the  signs 
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of  all  instantaneous  responses  to  the  corresponding  shock  to  find  the  desired 
result. 

For  later  purposes,  it  is  also  worth  noting  that  the  restrictions  can  be 
expressed  in  the  alternative  form 

vec(B)  =  i?B7B,  (9.1.9) 

where  7b  contains  all  the  unrestricted  coefficients  of  B  and  Rq  is  a  fixed  matrix 
of  zeros  and  ones  (see  Section  5.2.1). 

9.1.3  The  AB-Model 

It  is  also  possible  to  consider  both  types  of  restrictions  of  the  previous  sub¬ 
sections  simultaneously.  That  is,  we  may  consider  the  so-called  AB-model, 

Aut  =  Bst,  et  ~  (0,  IK).  (9.1.10) 

In  this  case,  a  simultaneous  equations  system  is  formulated  for  the  errors  of  the 
reduced  form  model  rather  than  the  observable  variables  directly.  Thereby  the 
model  accounts  for  the  shift  from  specifying  direct  relations  for  the  observable 
variables  to  formulating  relations  for  the  innovations.  Applications  of  this 
methodology  can,  for  instance,  be  found  in  Gall  (1992)  and  Pagan  (1995)  (see 
also  Breitung  et  al.  (2004)  for  further  discussion  and  an  illustration). 

In  this  model,  we  get  from  (9.1.10),  ut  =  A-1  Bey  and,  hence,  Su  = 
A-1BB,A~1^  Thus,  we  have  K(K  +  l)/2  equations 

vech(I7u)  =  vechtA-WA-1'),  (9.1.11) 

whereas  the  two  matrices  A  and  B  have  K2  elements  each.  Thus,  we  need 
additionally  2 K2  —  \K(K  +  1)  restrictions  to  identify  all  2K2  elements  of  A 
and  B  at  least  locally.  Even  if  the  diagonal  elements  of  A  are  set  to  one,  2 K2  — 
K  —  \K(K  +  1)  further  restrictions  are  needed  for  identification.  Therefore, 
it  is  perhaps  not  surprising  that  most  applications  consider  special  cases  with 
A  =  Ik  (B-model)  or  B  =  IK  (A- model).  Still,  the  general  model  is  a  useful 
framework  for  SVAR  analysis.  The  restrictions  are  typically  normalization  or 
zero  restrictions  which  can  be  written  in  the  form  of  linear  equations, 

vec(A)  =  Ra7a  +  ?’a  and  vec(B)  =  RB7b  +  r&,  (9.1.12) 

where  R/\  and  I?b  are  suitable  fixed  matrices  of  zeros  and  ones,  7a  and  7b 
are  vectors  of  free  parameters  and  ?’a  and  tb  are  vectors  of  fixed  parameters 
which  allow,  for  instance,  to  normalize  the  diagonal  elements  of  A.  Although 
r b  is  typically  zero,  as  in  (9.1.9),  we  present  the  restrictions  for  B  here  with  a 
general  ?’b  vector  because  this  additional  term  will  not  complicate  the  analysis. 

Multiplying  the  two  sets  of  equations  in  (9.1.12)  by  orthogonal  comple¬ 
ments  of  Ra  and  i?B,  and  Rb±,  respectively,  it  is  easy  to  see  that  they 
can  be  written  alternatively  in  the  form 
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C\vec(A)  =  ca  and  CBvec(B)  =  cb,  (9.1.13) 

where  Ca  =  Ra±,  C b  =  Rb±,  ca  =  -RajTa  and  cb  =  Rb±tb  (see  Appendix 
A. 8. 2  for  the  definition  of  an  orthogonal  complement  of  a  matrix).  The  ma¬ 
trices  Ca  and  Cb  may  be  thought  of  as  appropriate  selection  matrices.  Again, 
in  general,  the  restrictions  will  ensure  only  local  uniqueness  of  A  and  B  due 
to  the  nonlinear  nature  of  the  full  set  of  equations  from  which  to  solve  for 
the  two  matrices.  The  following  proposition  states  a  rank  condition  for  local 
identification. 

Proposition  9.3  ( Local  Identification  of  the  AB -Model) 

Let  A  and  B  be  nonsingular  (K  x  K)  matrices.  Then,  for  a  given  sym¬ 
metric,  positive  definite  ( K  x  K )  matrix  Su,  the  system  of  equations  in 
(9.1.11)/ (9.1.13)  has  a  locally  unique  solution  if  and  only  if 

'  -2D+  (Eu  ®  A-1)  2D+ (A-!B  g>  A^1) 
rk  Ca  0 

0  CB 


Proof:  Again,  we  can  use  the  same  reasoning  as  in  the  proof  of  Proposition 
9.1.  The  result  of  Proposition  9.3  is  then  obtained  by  noting  that 
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TV+  r,A-iD„  -  ,  3vec(A-1B) 


D+  (A-1B 
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D+  [(A-1B  ®  lK)  +  (1K  ®  A-1B)K/^-] 

-Dj)(/K2  +K^)(ru®A  x) 
~2D+(27tt®  A-1) 


and 

3vech(A-1BB'A'-1)  _  3vech(A-1BB,A'^1)  3vec(A-1B) 

3vec(B)'  3vec(A_1B)'  3vec(B)' 

=  D+(iif2+K;fJf)(A-1B|A-1) 

=  2D+(A-1B®  A’1), 

because  D^Kj^  =  (see  Appendix  A. 12. 2).  ■ 
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To  illustrate  the  AB-model,  we  follow  Breitung  et  al.  (2004)  and  use  a 
small  macro  system  from  Pagan  (1995)  for  output  qt ,  an  interest  rate  it, 
and  real  money  m*.  The  residuals  of  the  reduced  form  VAR  model  will  be 
denoted  by  ut  =  (uj ,  u\,  u™)' .  Pagan  (1995)  uses  Keynesian  arguments  to 
specify  the  following  relations  between  the  reduced  form  residuals  and  the 
structural  innovations: 

u®  =  —a i2u\  +  b nels  (IS  curve), 

u\  =  — a2i Ut  —  a23 u™  +  b 22efM  (inverse  LM  curve), 

u"1  =  (money  supply  rule). 


Here  £t  =  {e{s  ,£t  ,e™)'  is  the  vector  of  structural  innovations  with  et  ~ 

(0,1k)  (see  Breitung  et  al.  (2004)  for  further  discussion  of  this  example  sys¬ 
tem)  . 

For  our  purposes  the  three  equations  can  be  written  in  AB-model  form  as 
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Thus,  we  have  the  following  set  of  restrictions: 
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+  1)  =  12  restrictions  on  A  and  B  for 
identification  in  this  example  model.  There  are  3  zeros  and  3  ones  in  A.  Thus, 
we  have  6  restrictions  on  this  matrix.  In  addition,  there  are  6  zero  restrictions 
for  B. 
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Writing  the  restrictions  in  the  form  (9.1.13),  we  get 
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Thus,  the  necessary  condition  for  local  identification  is  satisfied.  The  neces¬ 
sary  and  sufficient  condition  from  Proposition  9.3  can  be  checked  by  selecting 
randomly  drawn  matrices  A  and  B  from  the  restricted  parameter  space  and 
determining  the  rank  of  the  corresponding  matrix  in  (9.1.14). 


9.1.4  Long-Run  Restrictions  a  la  Blanchard-Quah 

Clearly,  it  is  not  always  easy  to  find  suitable  and  generally  acceptable  restric¬ 
tions  for  the  matrices  A  and  B.  Imposing  the  restrictions  directly  on  these 
matrices  is  in  fact  not  necessary  to  identify  the  structural  innovations  and 
impulse  responses.  Another  type  of  restrictions  was  discussed  by  Blanchard 
&  Quah  (1989).  They  considered  the  accumulated  effects  of  shocks  to  the  sys¬ 
tem.  In  terms  of  the  structural  impulse  responses  in  (9.1.5)  they  focussed  on 
the  total  impact  matrix, 

OO 

S’*,  =  ^  Gi  =  (1K  ~  Ax - 24p)-1A-1B,  (9.1.15) 

i= 0 

and  they  identified  the  structural  innovations  by  placing  zero  restrictions  on 
this  matrix.  In  other  words,  they  assumed  that  some  shocks  do  not  have 
any  total  long-run  effects.  In  particular,  they  considered  a  bivariate  system 
consisting  of  output  growth  qt  and  an  unemployment  rate  urt  (i.e.,  yt  = 
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( qt,urt )')  and  they  assumed  that  the  structural  innovations  represent  supply 
and  demand  shocks.  Moreover,  they  assumed  that  the  demand  shocks  have 
only  transitory  effects  on  qt  and  that  the  accumulated  long-run  effect  of  such 
shocks  on  qt  is  zero.  Placing  the  supply  shocks  first  and  the  demand  shocks 
last  in  the  vectors  of  structural  innovations  et  =  the  (l,2)-element  of 

is  restricted  to  be  zero.  In  other  words,  we  restrict  the  upper  right-hand 
corner  element  of 


— oo  —  (Ik  —  A\  —  ■  ■  ■  —  Ap)  :A  1B 

to  zero.  Given  the  VAR  parameters,  this  set  of  equations  clearly  specifies  a 
restriction  for  A_1B.  Thereby  we  have  enough  restrictions  for  identification  of 
a  bivariate  system  if  we  set  A  =  1K,  because,  for  K  =  2,  we  have  K(K— 1)/2  = 
1.  Notice  that  A  =  Ik  may  be  chosen  because  the  idea  is  to  identify  the 
structural  shocks  from  the  reduced  form  residuals  only  and  no  restrictions  are 
placed  on  the  instantaneous  effects  of  the  observable  variables  directly.  Thus, 
we  have  a  B-model  with  restriction 

(0, 0, 1, 0)vec[(//f  —  Ai - Ap)-^} 

=  (0, 0, 1, 0)[y2  (8)  (1K  -  Ai - Ap)_1]vec(B)  =  0. 

In  summary,  the  A  B-model  offers  a  useful  general  framework  for  placing 
identifying  restrictions  for  the  structural  innovations  and  impulse  responses 
on  a  VAR  process.  The  restrictions  can  be  simple  normalization  and  exclusion 
(zero)  restrictions  and  may  also  be  more  general  nonlinear  restrictions.  Clearly, 
before  we  can  actually  use  this  framework  in  practice,  it  will  be  necessary 
to  estimate  the  reduced  form  and  structural  parameters.  Estimation  of  the 
former  parameters  has  been  discussed  in  some  detail  in  previous  chapters. 
Thus,  it  remains  to  consider  estimation  of  the  A,  B  matrices.  We  will  do  so  in 
Section  9.3.  Before  turning  to  inference  procedures,  we  will  consider  structural 
restrictions  for  VECMs  in  the  following  section. 


9.2  Structural  Vector  Error  Correction  Models 

If  all  or  some  of  the  variables  of  interest  are  integrated,  the  previously  dis¬ 
cussed  A  B-model  can  still  be  used  together  with  the  levels  VAR  form  of  the 
data  generation  process.  In  most  of  the  analysis  of  Section  9.1,  the  stationarity 
of  the  process  was  not  used.  Only  in  the  treatment  of  the  Blanchard-Quah 
restrictions,  stability  of  the  VAR  operator  is  required  because  otherwise  the 
matrix  of  total  accumulated  long-run  effects  does  not  exist.  This  result  follows 
from  the  fact  that  the  matrix  (Ik  —  A\  —  ■  ■  ■  —  Ap)  is  singular  for  cointegrated 
processes,  as  we  have  seen  in  Chapter  6.  In  other  cases,  we  may  use  the  AB- 
model  even  for  integrated  variables.  In  fact,  we  can  even  specify  and  fit  a 
reduced  form  VECM,  convert  that  model  to  the  levels  VAR  form  and  then 
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use  it  as  a  basis  for  an  A B- analysis,  as  discussed  in  the  previous  section.  There 
are,  however,  advantages  in  utilizing  the  cointegration  properties  of  the  vari¬ 
ables.  They  provide  restrictions  which  can  be  taken  into  account  beneficially 
in  identifying  the  structural  shocks.  Therefore,  it  is  useful  to  treat  SVECMs 
separately. 

As  in  the  previous  chapters,  we  assume  that  all  variables  are  at  most  1(1) 
and  that  the  data  generation  process  can  be  represented  as  a  VECM  with 
cointegration  rank  r  of  the  form 

Ayt  =  oc(3  yt-i  +  ^i^yt-i  +  ■  ■  ■  +  ^p-i^yt-p+i  +  ut,  (9.2.1) 

where  all  symbols  have  their  usual  meanings.  In  other  words,  yt  is  a  K- 
dimensional  vector  of  observable  variables,  a  is  a  (If  x  r)  matrix  of  loading 
coefficients,  p  is  the  (K  x  r)  cointegration  matrix,  It,-  is  a  ( K  x  K)  short-run 
coefficient  matrix  for  j  =  1, . . .  ,p  —  1,  and  Ut  is  a  white  noise  error  vector  with 
ut  ~  (0,  Su). 

In  Chapter  6,  Proposition  6.1,  we  have  seen  that  the  process  has  the 
Beveridge-Nelson  MA  representation 

t  OO 

yt  =  a^2ul  +  ^2s*ut-j+y^  (9.2.2) 

*= i  j=o 


where  the  E*  are  absolutely  summable  so  that  the  infinite  sum  is  well-defined 
and  the  term  y (j  contains  the  initial  values.  Absolute  summability  of  the  E* 
implies  that  these  matrices  converge  to  zero  for  j  — »  oo.  Thus,  the  long-run 
effects  of  shocks  are  captured  by  the  common  trends  term  E  y)*_ T  Uj.  The 
matrix 


-l 


has  rank  K  —  r.  Thus,  there  are  K  —  r  common  trends  and  if  the  structural 
innovations  embodied  in  the  it*  can  be  recovered,  at  most  r  of  them  can  have 
transitory  effects  only  because  the  matrix  E  or  a  nonsingular  transformation 
of  this  matrix  cannot  have  more  than  r  columns  of  zeros.  Thus,  by  knowing 
the  cointegrating  rank  of  the  system,  we  know  already  the  maximum  number 
of  transitory  shocks. 

In  this  context,  the  focus  of  interest  is  usually  on  the  residuals  and,  hence, 
in  order  to  identify  the  structural  innovations,  the  B-model  setup  is  typically 
used.  In  other  words,  we  are  looking  for  a  matrix  B  such  that 

ut  =  Bst  with  et  ~  (0,IK). 

Substituting  this  relation  in  the  common  trends  term  gives  EB  Xu=i  e*-  Hence, 
the  long-run  effects  of  the  structural  innovations  are  given  by 


E  =  Pi 


p-1 


a  i 


-£r<  P-i 
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SB. 

Because  the  structural  innovations  represent  a  regular  random  vector  with 
nonsingular  covariance  matrix,  the  matrix  B  has  to  be  nonsingular.  Recall 
that  Su  =  BB'.  Thus,  rk(HB)  =  K  —  r  and  there  can  be  at  most  r  zero 
columns  in  this  matrix.  In  other  words,  r  of  the  structural  innovations  can 
have  transitory  effects  and  K  —  r  of  them  must  have  permanent  effects.  If 
there  are  r  transitory  shocks,  we  can  restrict  r  columns  of  SB  to  zero.  Because 
the  matrix  has  reduced  rank  K  —  r,  each  column  of  zeros  stands  for  K  —  r 
independent  restrictions  only.  Thus,  the  r  transitory  shocks  represent  r(K  —  r) 
independent  restrictions  only.  Still,  it  is  useful  to  note  that  restrictions  can 
be  imposed  on  the  basis  of  our  knowledge  of  the  cointegrating  rank  of  the 
system  which  can  be  determined  by  statistical  means.  Further  theoretical 
considerations  are  required  for  imposing  additional  restrictions,  however. 

For  local  just-identification  of  the  structural  innovations  in  the  B-model, 
we  need  a  total  of  K(K  —  l)/2  restrictions.  Assuming  that  there  are  r  shocks 
with  transitory  effects  only,  we  have  already  r(K  —  r )  restrictions  from  the 
cointegration  structure  of  the  model,  this  leaves  us  with  —r(K  —  r) 

further  restrictions  for  just-identifying  the  structural  innovations.  In  fact,  r(r— 
l)/2  additional  contemporaneous  restrictions  are  needed  to  disentangle  the 
transitory  shocks  and  (K —r)((K —r)— 1)/2  restrictions  identify  the  permanent 
shocks  (see,  e.g.,  King  et  al.  (1991),  Gonzalo  &  Ng  (2001)).  Then  we  have  a 
total  of  \r{r—  1)  +  \{K  —  r)((K  —  r)  —  1)  =  ^K(K—  1)  —  r(K  —  r)  restrictions, 
as  required.  Thus,  it  is  not  sufficient  to  impose  arbitrary  restrictions  on  B  or 
SB,  but  we  have  to  choose  them  to  identify  the  transitory  and  permanent 
shocks  at  least  locally.  In  fact,  the  transitory  shocks  can  only  be  identified 
through  restrictions  directly  on  B  because  they  correspond  to  zero  columns 
in  EB.  Thus,  r(r  —  l)/2  of  the  restrictions  have  to  be  imposed  on  B  directly. 
Generally,  the  restrictions  have  the  form 

CsBvec(EB)  =  q  or  C)vec(B)  =  q  and  Csvec(B)  =  cs,  (9.2.3) 

where  6)  :=  Csb(Ik  ®  S)  is  a  matrix  of  long-run  restrictions,  that  is,  C'hb 
is  a  suitable  selection  matrix  such  that  C’=Bvec(EB)  =  a,  and  Cs  specifies 
short-run  or  instantaneous  constraints  by  restricting  elements  of  B  directly. 
Here  c;  and  cs  are  vectors  of  suitable  dimensions.  In  applied  work,  they  are 
typically  zero  vectors.  In  other  words,  zero  restrictions  are  specified  in  (9.2.3) 
for  EB  and  B. 

As  discussed  for  the  stationary  case  in  Section  9.1.2,  the  matrix  B  will 
only  be  locally  identified.  In  particular,  in  general  we  may  reverse  the  signs  of 
the  columns  of  B  to  find  another  valid  matrix.  Formal  necessary  and  sufficient 
conditions  for  local  identification  are  given  in  the  following  proposition. 

Proposition  9.4  ( Local  Identification  of  a  SVECM) 

Suppose  the  reduced  form  model  (9.2.1)  with  Beveridge-Nelson  MA  represen¬ 
tation  (9.2.2)  is  given.  Let  B  be  a  nonsingular  (. K  x  K)  matrix.  Then,  the  set 
of  equations 
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Su  =  BB',  Cjvec(B)  =  ci  and  Csvec(B)  =  cs, 

with  Ci,  Ci,  Cs,  and  cs  as  in  (9.2.3),  has  a  locally  unique  solution  for  B  if  and 
only  if 


rk 


2D+(B  ®IK) 
Ci 

CK 


=  K2. 


Proof:  The  model  underlying  Proposition  9.4  is  a  B-model.  Therefore  the 
proposition  can  be  shown  using  the  same  arguments  as  for  Proposition  9.2. 
Details  are  omitted.  ■ 

As  an  example,  we  consider  a  small  model  discussed  by  King  et  al.  (1991). 
They  specified  a  model  for  the  logarithms  of  private  output  (qt),  consump¬ 
tion  (ct),  and  investment  (it).  Assuming  that  all  three  variables  are  7(1)  with 
cointegrating  rank  r  =  2  and  that  there  are  two  transitory  shocks  and  one  per¬ 
manent  shock,  the  permanent  shock  is  identified  without  further  assumptions 
because  K  —  r  =  1  and,  hence,  (K  —  r)((K  —  r)  —  l)/2  =  0.  Moreover,  only 
1  (=  r(r  —  l)/2)  further  restriction  is  necessary  to  identify  the  two  transitory 
shocks.  Placing  the  permanent  shock  first  in  the  et  vector  and  allowing  the 
first  transitory  shock  to  have  instantaneous  effects  on  all  variables,  we  may 
use  the  following  restrictions: 


* 

0 

0  " 

* 

* 

* 

* 

0 

0 

and  B  = 

* 

* 

0 

* 

0 

0 

* 

* 

* 

Here  asterisks  denote  unrestricted  elements.  The  two  zero  columns  in  HB 
represent  two  independent  restrictions  only  because  HB  has  rank  1.  A  third 
restriction  is  placed  on  B  in  such  a  way  that  the  third  shock  does  not  have  an 
instantaneous  effect  on  the  second  variable.  Hence,  there  are  K(K  —  l)/2  =  3 
independent  restrictions  in  total  and  the  structural  innovations  are  locally 
just-identified.  Uniqueness  can  be  obtained  by  fixing  the  signs  of  the  diagonal 
elements  of  B. 

In  our  three-dimensional  example  with  two  zero  columns  in  HB,  it  does  not 
suffice  to  impose  a  further  restriction  on  this  matrix  to  ensure  local  uniqueness 
of  B.  For  that  we  need  to  disentangle  the  two  transitory  shocks  which  cannot 
be  identified  by  restrictions  on  the  long-run  matrix  HB.  Thus,  we  have  to 
impose  a  restriction  directly  on  B.  In  fact,  it  is  necessary  to  restrict  an  element 
in  the  last  two  columns  of  B  (see  also  Problem  9.1  for  further  details). 

In  the  standard  B-model  with  three  variables,  we  need  to  specify  at  least 
3  restrictions  for  identification.  In  contrast,  in  the  present  VECM  case,  as¬ 
suming  that  r  =  2  and  there  are  two  transitory  shocks,  only  one  restriction 
is  needed  because  two  columns  of  HB  are  zero.  Thus,  taking  into  account  the 
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long-run  restrictions  from  the  cointegration  properties  of  the  variables  may 
result  in  substantial  simplifications.  In  fact,  for  a  bivariate  system  with  one 
cointegrating  relation,  no  further  restriction  is  required  to  identify  the  per¬ 
manent  and  transitory  shocks.  It  is  enough  to  specify  that  the  first  shock  is 
allowed  to  have  permanent  effects  while  the  second  one  can  only  have  transi¬ 
tory  effects  or  vice  versa.  A  more  detailed  higher-dimensional  example  may  be 
found  in  Breitung  et  al.  (2004).  Further  discussion  of  partitioning  the  shocks 
in  permanent  and  transitory  ones  is  also  given  in  Gonzalo  &  Ng  (2001)  and 
Fisher  &  Huh  (1999). 


9.3  Estimation  of  Structural  Parameters 

We  will  first  consider  estimation  of  the  AB-SVAR  model  and  then  discuss 
SVECMs.  The  A-  and  B-models  are  straightforward  special  cases  which  are 
not  treated  separately  in  detail.  For  both  SVARs  and  SVECMs,  ML  methods 
are  typically  used  and  they  will  therefore  be  presented  here. 


9.3.1  Estimating  SVAR  Models 


Suppose  we  wish  to  estimate  the  following  SVAR  model 


Ay*  —  AAW_i  +  B  et, 


(9.3.1) 


where  Yl_x  :=  [y't_1, . . . ,  y£_p],  A  :=  [Ai,...,Ap],  and  st  is  assumed  to  be 
Gaussian  white  noise  with  covariance  matrix  Ik,  St  ~  A/"(0 ,Ik)-  The  nor¬ 
mality  assumption  is  just  made  for  convenience  to  derive  the  estimators.  The 
asymptotic  properties  of  the  estimators  will  be  the  same  under  more  general 
distributional  assumptions,  as  usual.  The  reduced  form  residuals  correspond¬ 
ing  to  (9.3.1)  have  the  form  ut  =  A_1Be4. 

From  Chapter  3,  Section  3.4,  the  log-likelihood  function  for  a  sample 
j/i .... ,  ijt  is  seen  to  be 


lnZ(A,A,B)  =  -^ln27r- f  ln|A-1BB'A,_1| 

-hr \(Y  -  AXYfA^BB'A'-R-Ry  -  AX)} 

(9.3.2) 

=  constant  +  j  In  |A|2  —  ^  In  | B | 2 

—  |tr{A'B'_1B_1A(y  -  AX)(Y  -  AX)'}, 

where,  as  usual,  Y  :=  [y\, . . . ,  yr],  X  :=  [To, ... ,  lx-i],  and  the  matrix  rules 
|A_1BB'(A_1)'|  =  lA-^IBI2  =  |A|-2|B|2  and  tr(VTV)  =  tr (WV)  have  been 
used  (see  Appendix  A). 

Suppose  there  are  no  restrictions  on  the  reduced  form  parameters  A.  Then, 
it  follows  from  Section  3.4  that  for  any  given  A  and  B,  the  log-likelihood 
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function  lnZ(A,  A,  B)  is  maximized  with  respect  to  A  by  A  =  Y X' {XX')  b 
Thus,  replacing  A  with  A  in  (9.3.2)  gives  the  concentrated  log-likelihood 


In  lc(A,  B)  =  constant-1-—  In  | A|2  —  —  In  | B | 2  —  —  tr(A'B'  1B  1AZ'll),  (9.3.3) 

where  Eu  =  T~1(Y  —  AX)(Y  —  AX)' .  Maximization  of  this  function  with 
respect  to  A  and  B,  subject  to  the  structural  restrictions  (9.1.12)  or  (9.1.13), 
has  to  be  done  by  numerical  methods  because  a  closed  form  solution  is  usually 
not  available.  If  the  restrictions  are  of  the  form  (9.1.12),  restricted  maximiza¬ 
tion  of  the  concentrated  log-likelihood  amounts  to  maximization  with  respect 
to  7a  and  7b-  If  these  parameters  are  locally  identified,  the  ML  estimators 
have  standard  asymptotic  properties  which  are  summarized  in  the  following 
proposition. 

Proposition  9.5  ( Properties  of  the  SVAR  ML  Estimators) 

Suppose  yt  is  a  stationary  Gaussian  VAR(p)  as  in  (9.1.1)  and  structural  re¬ 
strictions  of  the  form  (9.1.12)  are  available  such  that  7a  and  7b  are  locally 
identified.  Then  the  ML  estimators  7a  and  7b  are  consistent  and  asymptoti¬ 
cally  normally  distributed, 


Vf 


ZA 

7B 


7A 

7B 


•AM  0,TU 


7A 

7b 


where  X,  ( ■ )  is  the  asymptotic  information  matrix.  It  has  the  form 


la 


7A 

7b 


and 


R'a 

0 

I  ( 

'  vec  A  \ 

Ra 

0 

0 

R'b  . 

-^a  1 

v  vec  B  J 

0 

Rb 

la 


vec  A  \ 
vec  B  J 
A_1B  ®  B,_1 
-(- Ik  ®  B'”1) 
x  [(B'A,_1  <g> 


{Ik  2  +  K  kk) 

B-1)  :  ~{IK  ®  B^1)] 


(9.3.4) 


Proof:  The  proposition  follows  from  the  general  ML  theory  (see  Appendix 
C.6).  For  the  derivation  of  the  asymptotic  information  matrix  see  Problem 


9.4. 


If  7a  and  7b  are  identified,  the  same  is  true  for  A  and  B.  Estimating  these 
matrices  such  that  vec(A)  =  I?a7a  +  ^a  and  vec(B)  =  X’bT'b  +  t’b,  respectively, 
we  get  the  following  immediate  implication  of  Proposition  9.5. 
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Corollary  9.5.1 

Under  the  conditions  of  Proposition  9.5, 


Vt(^ 

where 
Lab  = 


vec  A 
vec  B 


vec  A 
vec  B 


AT  (0,  Sab). 


i?A 
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1  ( 

'  7A  N 

-1 
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i?A  0 

0 

Rb 

-^a  1 

,  7b  , 

1 

0  R'b 

If  only  just-identifying  restrictions  are  imposed  on  the  structural  parame¬ 
ters,  we  have  for  the  ML  estimator  of  Su, 

Su  =  T~1(Y  -  AX)(Y  -  AX)'  =  A^BB'A'"1. 

If,  however,  over-identifying  restrictions  have  been  imposed  on  A  and/or  B, 
the  corresponding  estimator  for  Su, 

Sru  :=  A-1BB'A'-1,  (9.3.5) 

will  differ  from  £u.  In  fact,  the  LR  statistic, 

AM  =  T(ln|^|-ln|£u|),  (9.3.6) 

can  be  used  to  check  the  over-identifying  restrictions.  Under  the  null  hypoth¬ 
esis  that  the  restrictions  are  valid,  it  has  an  asymptotic  ^-distribution  with 
degrees  of  freedom  equal  to  the  number  of  over-identifying  restrictions.  In 
other  words,  the  number  of  degrees  of  freedom  is  equal  to  the  number  of 
independent  constraints  imposed  on  A  and  B  minus  2K 2  —  | K{K  +  1). 


Computation  of  ML  Estimates 


Because  the  structural  parameters  A  and  B  are  nonlinearly  related  to  the 
reduced  form  parameters,  no  closed  form  of  the  ML  estimates  exists  in  general 
and  an  iterative  optimization  algorithm  may  be  used  for  actually  computing 
the  ML  estimates.  Amisano  &  Giannini  (1997)  proposed  to  use  a  scoring 
algorithm  for  this  purpose.  The  i-tli  iteration  of  this  algorithm  is  of  the  form 


ZA 

7B 


i+ 1 


ZA 

7b 


+  il 


(9.3.7) 


where  !{■)  denotes  the  information  matrix  of  the  free  parameters  yA,  7b,  that 
is,  in  this  case  X(-)  =  TTa(-),  s(-)  is  the  score  vector  and  i  is  the  step  length 
(see  also  Chapter  12,  Section  12.3.2,  for  further  discussion  of  optimization 
algorithms  of  this  type). 
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The  score  vector  can  be  obtained  using  the  rules  for  matrix  and  vector 
differentiation  (Appendix  A.  13).  Applying  the  chain  rule  for  vector  differen¬ 
tiation,  it  is  seen  to  be 


/  7a  \  _  9  In  l  _  R'a  0  /  vec  A  \ 

V  7b  )  ~  d( 7a,  7b)'  ~~  [  0  J  S  V  vec  B  )  ’ 


(9.3.8) 


and 


(  vec  A 
V  vec  B 


9  In  l 


(Ik  ®  B'"1) 
-(B_1A  (g»  B'^1) 


s(vec[B  :A]) 


with 

s(vec[B_1A])  =  Tvec([B-1A]'-1)  -  T(SU  ®  /x)vec(B-1A) 

(see  Problem  9.3  for  further  details).  In  practice,  the  iterations  of  the  scoring 
algorithm  terminate  if  prespecified  convergence  criteria,  such  as  the  relative 
change  in  the  log-likelihood  and  the  parameters,  are  satisfied.  For  this  algo¬ 
rithm  to  work,  the  inverse  of  the  information  matrix  has  to  exist  which  is 
guaranteed  by  the  identification  of  the  parameters,  at  least  in  a  neighborhood 
of  the  true  parameter  values.  Giannini  (1992)  used  this  property  to  derive 
alternative  conditions  for  identification  of  the  models  presented  in  Section 
9.1.  More  precisely,  he  derived  identification  conditions  from  the  fact  that,  for 
instance,  the  AB-model  is  locally  identified  if  and  only  if  the  matrix 


(9.3.9) 


has  full  column  rank  when  Ta(-)  is  evaluated  at  the  true  parameter  values  (see 
Rothenberg  (1971)). 

Although  we  have  discussed  models  without  deterministic  terms  and  re¬ 
strictions  on  the  reduced  form  parameters,  the  ML  estimation  procedure  for 
the  structural  parameters  can  be  extended  easily  to  more  general  situations 
which  cover  these  complications.  Again,  estimation  of  the  structural  parame¬ 
ters  can  be  based  on  the  concentrated  likelihood  function.  If  there  are  restric¬ 
tions  for  the  reduced  form  parameters  A,  for  example,  if  a  subset  model  is 
considered,  one  may  even  use  the  EGLS  estimator  instead  of  the  ML  estima¬ 
tor  for  these  parameters  in  estimating  the  structural  parameters.  Clearly,  in 
that  case,  the  white  noise  covariance  estimator  Su  will  not  be  the  exact  ML 
estimator  and  the  exact  concentrated  log-likelihood  is  obtained  only  if  ML 
estimators  are  substituted  for  the  reduced  form  parameters  A.  Asymptoti¬ 
cally,  the  corresponding  estimators  A,  B  based  on  the  EGLS  estimators  will 
have  the  same  properties  as  the  exact  ML  estimators,  however.  Even  in  small 
samples,  exact  ML  estimation  may  not  result  in  substantial  gains  (see,  e.g., 
Briiggemann  (2004)). 
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Estimation  with  Long-Run  Restrictions  a  la  Blanchard-Quah 

If  the  total  impact  matrix  is  restricted  to  be  triangular  as  in  Blanchard  & 
Quah  (1989)  and  Gall  (1999),  estimation  becomes  particularly  easy.  Specifying 
A  =  IK,  using  the  relation  =  (. 1K  —  Ai  —  ■  ■  ■  —  Av)~l B  and  noting  that 

-oc^'oo  =  (Ik  ~  Ai  -  ■  ■  ■  —  Ap)~1Uu(lK  —  A[  -  ■  ■  ■  -  A p)~\ 

the  matrix  B  can  be  estimated  by  premultiplying  a  Choleski  decomposition 
of  the  matrix 

(Ik -A! - A^S^Ik  -  A[ - A^)"1 

by  (Ik  —  A\  —  •  •  •  —  Ap). 

This  latter  procedure  works  only  if  the  VAR  operator  is  stable  and  the 
process  is  stationary  because  for  integrated  processes  the  inverse  of  (1k~A\  — 
■  ■  ■  —  Ap)  does  not  exist,  as  explained  earlier.  On  the  other  hand,  cointegrated 
variables  do  not  create  problems  for  the  other  estimation  methods  for  SVAR 
models. 


9.3.2  Estimating  Structural  VECMs 

Suppose  the  structural  restrictions  for  a  VECM  are  given  in  the  form  of  lin¬ 
ear  restrictions  on  SB  and  B,  as  in  (9.2.3).  For  computing  the  parameter 
estimates,  we  may  replace  H  by  its  reduced  form  ML  estimator, 

-l 

where  the  IYs  are  the  ML  estimators  of  the  IYs  from  Proposition  7.3  and 
a_L  and  p±  are  any  orthogonal  complements  of  the  ML  estimators  a  and  p, 
respectively.  The  restricted  ML  estimator  of  B  can  be  obtained  by  setting 
A  =  Ik  and  optimizing  the  concentrated  log-likelihood  function  (9.3.3)  with 
respect  to  B,  subject  to  the  restrictions  (9.2.3),  with  Ci  replaced  by 

Ci  =  Csb(Ik  ®  S) 

(see  Vlaar  (2004)).  Although  this  procedure  results  in  a  set  of  stochastic  re¬ 
strictions,  from  a  numerical  point  of  view  we  have  a  standard  constrained 
optimization  problem  which  can  be  solved  by  a  Lagrange  approach  (see  Ap¬ 
pendix  A.  14)  because  3  is  fixed  in  computing  the  estimate  of  B.  Due  to  the 
fact  that  for  a  just-identified  structural  model  the  log-likelihood  maximum  is 
the  same  as  for  the  reduced  form,  a  comparison  of  the  log-likelihood  values 
can  serve  as  a  check  for  a  proper  convergence  of  the  optimization  algorithm 
used  for  structural  estimation. 


3  =  PJ 
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The  properties  of  the  ML  estimator  of  B  follow  in  principle  from  Corol¬ 
lary  9.5.1.  In  other  words,  B  is  consistent  and  asymptotically  normal  under 
standard  conditions, 

Vt vec(B  -  B)  ±Af(0,EB). 

The  asymptotic  distribution  is  singular  because  of  the  restrictions  that  have 
been  imposed  on  B.  Thus,  although  f-ratios  can  be  used  for  assessing  the 
significance  of  individual  parameters,  F-tests  based  on  the  Wald  principle  will 
in  general  not  be  valid  and  have  to  be  interpreted  cautiously.  Expressions  for 
the  covariance  matrices  of  the  asymptotic  distributions  in  terms  of  the  model 
parameters  can  be  obtained  in  the  usual  way  by  working  out  the  corresponding 
information  matrices  (see  Vlaar  (2004)).  For  practical  purposes,  it  is  common 
to  use  bootstrap  methods  for  inference  in  this  context. 

In  principle,  the  same  approach  can  be  used  if  there  are  over-identifying 
restrictions  for  B.  In  that  case,  BB'  will  not  be  equal  to  the  reduced  form 
white  noise  covariance  estimator  Su,  however.  Still  the  estimator  of  B  will  be 
consistent  and  asymptotically  normal  under  general  conditions  and  also  the 
LR  statistic  given  in  (9.3.6)  can  be  used  to  check  the  validity  of  the  over¬ 
identifying  restrictions.  It  will  have  the  usual  asymptotic  ^-distribution  with 
degrees  of  freedom  equal  to  the  number  of  over-identifying  restrictions. 


9.4  Impulse  Response  Analysis  and  Forecast  Error 
Variance  Decomposition 

Impulse  response  analysis  can  now  be  based  on  structural  innovations.  In 
other  words,  the  impulse  response  coefficients  are  obtained  from  the  matrices 

Oj=$3  A^B,  j  =  0,1,2,.... 

Using  the  same  reasoning  as  in  Chapter  3,  Section  3.7,  the  corresponding  esti¬ 
mated  quantities  are  asymptotically  normal  as  nonlinear  functions  of  asymp¬ 
totically  normal  parameter  estimators, 

VTvec  (0,.-0,.)4V(O,i;g.). 

In  practice,  bootstrap  methods  are  routinely  employed  for  inference  in  this 
context.  However,  the  same  inference  problems  as  in  Chapter  3,  Section  3.7, 
prevail  for  structural  impulse  responses.  More  precisely,  the  asymptotic  distri¬ 
bution  may  be  singular  in  which  case  confidence  intervals  based  on  asymptotic 
theory  or  bootstrap  methods  may  not  have  the  desired  confidence  level  even 
asymptotically. 

We  use  a  set  of  quarterly  U.S.  data  for  the  period  1947.1-1988.4  from 
King  et  al.  (1991)  for  the  three  variables  log  private  output  ( qt ),  consumption 
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( Ct ),  and  investment  (it)  (all  multiplied  by  100)  to  illustrate  structural  impulse 
responses.1  The  three  series  are  plotted  in  Figure  9.1.  They  all  have  a  trending 
behavior  and  there  is  some  evidence  that  they  are  well  modelled  as  /(l)  series. 
Applying  LR  tests  for  the  cointegrating  rank  with  a  trend  orthogonal  to  the 
cointegration  relations  to  a  model  with  one  lagged  difference  of  the  variables, 
provides  evidence  for  two  cointegration  relations,  that  is,  r  =  2  (see  Section 
8.2.4  for  the  description  of  the  tests).  Therefore  we  proceed  from  the  following 
estimated  reduced  form  VECM  (f-statistics  in  parentheses): 


Aqt 

Act 

Ait 


-0.88 

(-0.2) 

-2.83 

(-l.i) 

-30.07 

(-4.1) 


1  The  data  are  available  at  the  website  http:/ /www.wws. princeton.edu/  mwatson/. 
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-0.23 

0.20 

(-3.6) 

(4.6) 

-0.06 

0.07 

(-1.5) 

(2.4) 

-0.11 

0.26 

(-0.9) 

(2.9) 

0.12 

0.09 

(1.2) 

(0.7) 

0.21 

-0.21 

(3.2) 

(-2.3) 

0.70 

-0.17 

(3.6) 

(-0.6) 

‘  1  0 
0  1 


-1.02 

(-27.7) 

-1.10 

(-24.2) 


Qt- 1 

Ci-l 

.  it~1 

0.16 

(3.4) 

0.02 

(0.8) 

0.33 

(3.6) 


Aqt- 1 

Uu 

Act- i 

+ 

U2t 

.  Ait-i  _ 

.  M3t  . 

Before  we  can  proceed  with  structural  estimation,  we  have  to  specify  iden¬ 
tifying  restrictions.  Using  the  zero  restrictions  from  (9.2.4),  the  following  es¬ 
timates  are  obtained: 


B  = 


and 


SB  = 


0.08 

1.03 

-0.45 

(0.4) 

(3.9) 

(-0.8) 

-0.60 

0.43 

0 

(-0.7) 

(4.1) 

0.26 

1.96 

1.00 

(0.6) 

(5.1) 

(1.9) 

-0.71  0  0 

(-0.8) 

-0.76  0  0 

(-0.8) 

-0.69  0  0 

(-0.8) 


(9.4.2) 


Here  bootstrapped  f-statistics  based  on  2000  bootstrap  replications  are  given 
in  parentheses.  In  other  words,  the  standard  deviations  of  the  estimates  are 
obtained  with  a  bootstrap  (see  Appendix  D.3)  and  then  the  estimated  coef¬ 
ficients  are  divided  by  their  respective  bootstrap  standard  deviations  to  get 
the  f-ratios.  Clearly,  some  of  the  f-ratios  are  quite  small.  Thus,  it  may  be 
possible  to  impose  over-identifying  restrictions.  In  fact,  because  all  t-ratios 
of  the  nonzero  long-run  effects  are  small,  it  may  be  tempting  to  argue  that 
no  significant  permanent  effect  is  found.  Recall,  however,  that,  based  on  the 
unit  root  and  cointegration  analysis,  there  cannot  be  more  shocks  with  tran¬ 
sitory  effects.  We  have  used  the  just-identified  model  for  an  impulse  response 
analysis  to  shed  more  light  on  this  issue. 

There  are  three  structural  innovations,  one  of  which  must  have  perma¬ 
nent  effects  if  the  cointegration  rank  is  2.  In  Figure  9.2,  the  responses  of  all 
three  variables  to  the  shock  with  potentially  permanent  effects  are  depicted. 
The  95%  confidence  intervals  are  based  on  2000  replications.  Considering  the 
confidence  intervals  determined  with  Hall’s  percentile  method  (see  Appendix 
D.3),  it  turns  out  that  none  of  the  confidence  intervals  associated  with  longer 
term  responses  contains  zero.  Hence,  a  significant  long-run  effect  may  actually 
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Fig.  9.2.  Responses  of  output,  consumption,  and  investment  (top  to  bottom)  to 
a  permanent  shock  with  Hall  percentile  (left)  and  standard  percentile  (right)  95% 
bootstrap  confidence  intervals  based  on  2000  bootstrap  replications. 


be  present  for  each  of  the  three  variables.  If,  however,  the  standard  percentile 
bootstrap  confidence  intervals  are  used  for  the  impulse  responses,  the  situa¬ 
tion  is  quite  different.  These  confidence  intervals  are  also  shown  in  Figure  9.2 
and  they  all  include  zero  for  longer  term  horizons.  Thus,  the  results  are  not 
very  robust  with  respect  to  the  methods  used.  Clearly,  the  confidence  intervals 
are  quite  asymmetric  around  the  point  estimates.  In  such  a  situation  the  Hall 
percentile  confidence  intervals  may  be  more  reliable  due  to  their  built-in  bias 
correction. 

The  estimated  responses  to  the  permanent  shock  are  all  negative  in  the 
long-run.  To  see  the  effects  of  an  impulse  which  leads  to  positive  long-run 
effects,  we  can  just  reverse  the  signs  of  the  responses.  This  follows  from  the 
unidentified  signs  of  the  columns  of  B  discussed  in  Sections  9.1.2  and  9.2. 
Generally,  the  effects  of  positive  and  negative  shocks  of  the  same  size  are 
identical  in  absolute  value  because  our  model  is  a  linear  one  which  does  not 
permit  asymmetric  reactions  to  positive  and  negative  shocks. 

In  Figure  9.3,  the  responses  of  the  variables  to  the  two  transitory  shocks 
are  shown.  All  impulse  responses  approach  zero  quickly  after  some  periods 
and  the  effects  of  the  shocks  after  20  periods  are  practically  negligible.  The 
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Fig.  9.3.  Responses  of  output,  consumption,  and  investment  (top  to  bottom)  to 
transitory  shocks  with  95%  Hall  percentile  bootstrap  confidence  intervals  based  on 
2000  bootstrap  replications  (identification  restriction  (9.4.2)). 


identifying  restriction  on  the  B  matrix  is  clearly  seen  in  the  right-hand  panel 
in  the  middle  row  of  Figure  9.3.  Here  the  instantaneous  effect  of  the  second 
transitory  shock  on  ct  is  zero.  If  a  zero  restriction  is  imposed  instead  on  the 
upper  right-hand  corner  element  of  B,  the  estimated  matrix  becomes 


0.08 

1.12 

0 

(0.4) 

(5.7) 

-0.60 

0.39 

0.17 

(-0.7) 

(2.9) 

(1.4) 

0.26 

1.39 

1.70 

(0.6) 

(4.5) 

(11.2) 

and  the  corresponding  structural  impulse  responses  are  depicted  in  Figure  9.4. 
Obviously,  the  identification  restriction  determines  to  some  extent  the  shape 
of  the  impulse  responses.  At  least  the  responses  to  the  second  transitory  shock 
are  quite  different  from  those  based  on  the  identification  restriction  (9.4.2). 
Now,  of  course,  qt  reacts  only  with  a  delay  to  the  second  transitory  shock. 
The  first  column  of  B  in  (9.4.3)  is  unchanged  relative  to  (9.4.2)  and,  more 
generally,  the  responses  to  the  permanent  shock  (not  shown)  are  unaffected 
because  that  shock  is  identified  without  additional  restrictions. 
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Fig.  9.4.  Responses  of  output,  consumption,  and  investment  (top  to  bottom)  to 
transitory  shocks  with  95%  Hall  percentile  bootstrap  confidence  intervals  based  on 
2000  bootstrap  replications  (identification  restriction  (9.4.3)). 


Forecast  error  variance  decompositions  can  also  be  based  on  the  structural 
innovations.  The  computations  are  based  on  the  Oj  as  in  Section  2.3.3.  The 
interpretation  may  be  different,  however.  It  may  not  be  possible  to  associate 
the  structural  innovations  uniquely  with  the  variables  of  the  system.  There¬ 
fore,  the  forecast  errors  are  not  decomposed  into  contributions  of  the  different 
variables  but  into  the  contributions  of  the  structural  innovations.  For  instance, 
for  the  example  system  with  identifying  restriction  on  B  as  in  (9.4.2),  a  fore¬ 
cast  error  variance  decomposition  is  shown  in  Figure  9.5.  Now  we  can  see  that 
the  permanent  shocks  (the  first  components  of  the  et’s)  have  a  growing  im¬ 
portance  with  increasing  forecast  horizon,  where  the  estimation  uncertainty 
is  ignored,  however.  In  turn,  the  importance  of  the  transitory  shocks  (shocks 
number  2  and  3)  declines  for  all  three  variables.  Actually,  the  third  shock  (the 
second  transitory  shock)  does  not  contribute  much  to  the  forecast  errors  of 
any  of  the  three  variables. 
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forecast  error  of  ’q’ 
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Fig.  9.5.  Forecast  error  variance  decomposition  of  the  output,  consumption,  and 
investment  system  based  on  identification  scheme  (9.4.2)  with  relative  contributions 
of  the  permanent  shock  (1)  and  the  two  transitory  shocks  (2  and  3). 


9.5  Further  Issues 

Structural  VARs  and  VECMs  have  not  only  found  widespread  use  in  applied 
work  but  there  are  also  numerous  further  methodological  contributions.  For 
example,  confidence  bands  for  impulse  responses  are  sometimes  constructed 
with  Bayesian  methods  (e.g.,  Koop  (1992)).  In  fact,  the  practice  of  reporting 
confidence  intervals  around  individual  impulse  response  coefficients  was  ques¬ 
tioned  by  Sims  &  Zha  (1999).  They  proposed  likelihood-characterizing  error 
bands  as  alternatives. 

Also  other  forms  of  identifying  restrictions  were  considered  by  some  au¬ 
thors.  For  example,  Uhlig  (1994)  proposed  to  use  inequality  constraints  for 
the  impulse  responses  for  identifying  them.  In  contrast,  Lee,  Pesaran  &  Pierse 
(1992)  and  Pesaran  &  Shin  (1996)  considered  persistence  profiles  which  mea¬ 
sure  the  persistence  of  certain  shocks  without  imposing  structural  identifica¬ 
tion  restrictions. 

It  may  be  worth  remembering,  however,  that  structural  impulse  responses 
are  not  immune  to  some  of  the  problems  discussed  in  Chapter  2  in  the  context 
of  impulse  response  analysis.  In  particular,  omitted  variables,  filtering  and 
adjusting  series  prior  to  using  them  for  a  VAR  analysis  and  using  aggregated 
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or  transformed  data  can  lead  to  major  changes  in  the  dynamic  behavior  of 
the  model.  For  instance,  if  an  important  variable  is  omitted  from  a  system  of 
interest,  adding  it  can  change  in  principle  all  the  impulse  responses.  Similarly, 
using  seasonally  adjusted  and,  hence,  filtered  data  can  change  the  dynamic 
structure  of  the  variables  and,  thus,  may  lead  to  impulse  responses  which  are 
quite  different  from  those  for  unadjusted  variables.  These  problems  are  not 
solved  by  imposing  identifying  restrictions  and  are  worth  keeping  in  mind  also 
in  a  structural  VAR  analysis. 


9.6  Exercises 


9.6.1  Algebraic  Problems 


Problem  9.1 

Show  that  for  a  three-dimensional  VECM  with  cointegration  rank  r  =  2,  the 
set  of  restrictions 


HB  = 


0  0  0 
*00 
*00 


is  not  sufficient  for  identification.  Moreover,  show  that  the  restrictions 


* 

0 

0  ' 

"  0 

* 

* 

* 

0 

0 

and  B  = 

* 

* 

* 

* 

0 

0 

* 

* 

* 

do  not  identify  B  locally. 
(Hint:  Choose 


B  = 


bn  0 

0  B2  ’ 


where  B2  is  a  (2  x  2)  matrix.  Show  that  B2  is  not  unique.) 

Problem  9.2 

Suppose  a  four-dimensional  process  yt  can  be  written  in  VECM  form  (9.2.1) 
with  cointegrating  rank  2.  Impose  just-identifying  restrictions  on  B  and  SB. 


Problem  9.3 

Define  C  =  B-1A  and  write  the  concentrated  log-likelihood  (9.3.3)  as 


In lc(C )  =  constant  +  Tin  \C\  —  ^-tr (C'CEU). 

Use  the  rules  for  matrix  differentiation  from  Appendix  A.  13  to  show  that 


d  In  lc 
dC 


=  TC'~~l  -  TCSU. 
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Next  show  that 

d  vec(B_1A) 
dvec(A)' 

and 


=  1 


K 


B”1 


<9vec(B  *A)  i  „  i. 

,,  ,RV  =-  A'B'-^B-1  . 
ovec(B)' 

Use  these  results  to  derive  an  explicit  expression  for  the  score  vector 


7A 


<91nZ 


7B  )  3(7a.7b)'’ 

Problem  9. 4 

Define  a  :=  [vec(A)',  vec(B)']'  and  7  :=  (7a>7b)/  and  show  that,  for  the  setup 
in  Proposition  9.5, 


-e(- 


fd2  In  A 

R'/k  o 

E(d2im\ 

R/k 

0 

V  d^dj' ) 

0  R’b 

\  dada ' ) 

0 

Rb 

Moreover,  show  that  (9.3.4)  holds  by  proving  that 


E 


/ d2  In  l  \  d  vec(£uy 


V  dada1 


da 


E 


d 2  In  l 


dvec{Eu)dvec{Euy )  da 


dvec(Eu) 


and,  for  C  such  that  CC'  =  Eu, 


dvec(£u)  dvec.(CC')  dvec.(C)  dvec(C) 

0a'  =  3vec(C)'  0a'  =  (4- +K«,)(C«J*)  — 


da' 


(see  also  Chapter  3  for  related  derivations). 


9.6.2  Numerical  Problems 

Problem  9.5 

Specify,  estimate,  and  analyze  a  model  for  U.S.  quarterly  log  output  ( qt )  and 
the  unemployment  rate  (wy)  for  the  period  1948.2-1987.4  as  given  in  the 
Journal  of  Applied  Econometrics  data  archive  at 
http :  I/ www .  econ .  queensu .  ca/j  ae/ 

(see  the  data  for  Weber  (1995)).  Blanchard  &  Quah  (1989)  considered  this 
system  in  their  study. 

(a)  Analyze  the  integration  and  cointegration  properties  of  the  data. 

(b)  Fit  a  suitable  VAR  model  to  the  bivariate  series. 

(c)  Check  the  adequacy  of  the  model. 

(d)  Impose  an  identifying  restriction  on  the  long-run  total  impact  matrix  and 
perform  a  structural  impulse  response  analysis. 
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(e)  Compare  standard  and  Hall  percentile  confidence  intervals  for  the  impulse 
responses  and  interpret  possible  differences. 

(f)  Perform  a  forecast  error  variance  decomposition  and  comment  on  the  re¬ 
sults. 

(Hint:  See  Breitung  et  al.  (2004)  for  a  similar  analysis.) 

Problem  9.6 

Analyze  the  Canadian  labor  market  data  from  Breitung  et  al.  (2004)  (see 
www.jmulti.de  — »  datasets 

for  the  data).  The  variables  are: 

Pt  -  In  productivity, 
et  -  In  employment, 
urt  -  unemployment  rate, 

Wt  -  In  real  wage  index. 

Thus,  yt  =  ( pt,  et,urt,wt. )'  is  four-dimensional.  The  data  are  quarterly  for  the 

period  1980.1-2000.4.  They  are  constructed  as  described  in  Breitung  et  al. 

(2004)  based  on  data  from  the  OECD  database.  Note  that  Breitung  et  al. 

(2004)  use  a  slightly  different  notation  for  the  variables. 

(a)  Analyze  the  integration  and  cointegration  properties  of  the  data. 

(b)  Fit  a  VECM  with  cointegration  rank  r  =  1  for  yt. 

(c)  Check  the  adequacy  of  your  model. 

(d)  Impose  identifying  restrictions  of  the  form 


* 

* 

* 

*1 

* 

0 

0 

0  " 

* 

* 

* 

* 

* 

* 

* 

0 

and 

HB  = 

* 

* 

* 

*  1 

* 

* 

* 

0 

* 

0 

* 

* 

* 

* 

0 

and  perform  a  structural  impulse  response  analysis. 

(e)  Compare  standard  and  Hall  percentile  confidence  intervals  for  the  impulse 
responses  and  interpret  possible  differences. 

(f)  Impose  another  zero  restriction  on  B  and  repeat  the  structural  impulse 
response  analysis. 

(g)  Perform  forecast  error  variance  decompositions  based  on  the  structural 
innovations  for  different  identification  schemes  and  comment  on  the  re¬ 
sults. 

(Hint:  See  Breitung  et  al.  (2004)  for  a  detailed  analysis  of  the  system.) 
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Systems  of  Dynamic  Simultaneous  Equations 


10.1  Background 

This  chapter  serves  to  point  out  some  possible  extensions  of  the  models  con¬ 
sidered  so  far  and  to  draw  attention  to  potential  problems  related  to  such 
extensions.  So  far,  we  have  assumed  that  all  stochastic  variables  of  a  system 
have  essentially  the  same  status  in  that  they  are  all  determined  within  the 
system.  In  other  words,  the  model  describes  the  joint  generation  process  of  all 
the  observable  variables  of  interest.  In  practice,  the  generation  process  may  be 
affected  by  other  observable  variables  which  are  determined  outside  the  sys¬ 
tem  of  interest.  Such  variables  are  called  exogenous  or  unmodelled  variables. 
In  contrast,  the  variables  determined  within  the  system  are  called  endogenous. 
Although  deterministic  terms  can  be  included  in  the  set  of  unmodelled  vari¬ 
ables,  we  often  have  stochastic  variables  in  mind  in  this  category.  For  instance, 
weather  related  variables  such  as  rainfall  or  hours  of  sunshine  are  usually  re¬ 
garded  as  stochastic  exogenous  variables.  As  another  example  of  the  latter 
type  of  variables,  if  a  small  open  economy  is  being  studied,  the  price  level  or 
the  output  of  the  rest  of  the  world  may  be  regarded  as  exogenous.  A  model 
which  specifies  the  generation  process  of  some  variables  conditionally  on  some 
other  unmodelled  variables  is  sometimes  called  a  conditional  or  partial  model 
because  it  describes  the  generation  process  of  a  subset  of  the  variables  only. 

A  model  with  unmodelled  variables  may  have  the  structural  form 

A yt  =  A\yt_ i  +  -  •  ■  +  A*yt_p  +  BqXt  +  B^xt-i  +  ■  ■  ■  +  B*xt-s  +  wt,  (10.1.1) 

where  yt  =  (yit,  •  •  ■ ,  UKt)'  is  a  A'-dimensional  vector  of  endogenous  variables, 
Xt  —  ( X\t ,  ■  ■  ■  i  XMt)'  is  an  M-dimensional  vector  of  unmodelled  variables,  A  is 
(K  x  I\ )  and  represents  the  instantaneous  relations  between  the  endogenous 
variables,  the  A*’s  and  B*’ s  are  (K  x  K)  and  ( K  x  M)  coefficient  matri¬ 
ces,  respectively,  and  Wt,  is  a  K -dimensional  error  vector.  The  vector  xt  may 
contain  both  stochastic  and  non-stochastic  components.  For  example,  it  may 
include  intercept  terms,  seasonal  dummies,  and  the  amount  of  rainfall  in  a  spe¬ 
cific  region.  If  the  error  term  wt.  is  white  noise,  a  model  of  the  type  (10.1.1)  is 
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sometimes  called  a  VARX(p,  s)  model  in  the  following.  More  generally,  models 
of  the  form  (10.1.1)  are  often  called  linear  systems  because  they  are  obviously 
linear  in  all  variables.  In  the  econometrics  literature,  the  label  (linear)  dy¬ 
namic  simultaneous  equations  model  (SEM)  is  used  for  such  a  model.  Because 
we  often  have  systems  of  economic  variables  in  mind  in  the  following  discus¬ 
sion,  we  will  use  this  name  occasionally.  We  will  also  consider  a  vector  error 
correction  version  of  the  model  which  is  useful  when  cointegrated  variables 
are  involved. 

Other  names  that  are  occasionally  found  in  the  related  literature  are  trans¬ 
fer  function  models  or  distributed  lag  models.  These  terms  will  become  more 
plausible  in  the  next  section,  where  different  representations  and  some  prop¬ 
erties  of  our  basic  model  (10.1.1)  will  be  discussed.  Estimation  is  briefly  con¬ 
sidered  in  Section  10.3  and  some  remarks  on  model  specification  and  model 
checking  follow  in  Section  10.4.  Possible  uses  of  such  models,  namely  fore¬ 
casting,  multiplier  analysis,  and  control,  are  treated  in  Sections  10.5-10.7. 
Concluding  remarks  are  contained  in  Section  10.8.  It  is  not  the  purpose  of 
this  chapter  to  give  a  detailed  and  complete  account  of  all  these  topics.  The 
chapter  is  just  meant  to  give  some  guidance  to  possible  extensions  of  the  by 
now  familiar  VAR  models  and  VECMs,  the  related  problems  and  some  further 
reading. 


10.2  Systems  with  Unmodelled  Variables 

10.2.1  Types  of  Variables 

In  the  dynamic  simultaneous  equations  model  (10.1.1),  we  have  partitioned 
the  observables  in  two  groups,  yt  and  Xt  .  The  components  of  yt  are  endoge¬ 
nous  variables  and  the  components  of  Xt  are  the  unmodelled  or  exogenous 
variables.  Although  we  have  given  some  explanation  of  the  differences  be¬ 
tween  the  two  groups  of  variables,  we  have  not  given  a  precise  definition  of 
the  terms  endogenous  and  exogenous  so  far.  The  idea  is  that  the  endoge¬ 
nous  variables  are  determined  within  the  system,  whereas  the  unmodelled, 
exogenous  variables  are  those  on  which  we  can  condition  the  analysis  without 
affecting  the  results  of  interest.  Because  there  are  different  possible  objectives 
of  an  analysis,  there  are  also  different  notions  of  exogeneity.  For  example,  if  we 
are  interested  in  estimating  a  particular  parameter  vector  7,  say,  Xt  is  exoge¬ 
nous  if  the  estimation  properties  do  not  suffer  from  conditioning  on  xt  rather 
than  using  a  full  model  for  the  data  generation  process  of  all  the  variables 
involved.  In  that  case,  Xt  is  called  weakly  exogenous  for  7.  This  and  other 
types  of  exogeneity  have  been  formalized  by  Engle,  Hendry  &  Richard  (1983). 
They  call  Xt  strongly  exogenous  if  we  can  condition  on  this  set  of  variables  for 
forecasting  purposes  without  loosing  forecast  precision  and  they  classify  Xt  as 
super-exogenous  if  policy  analysis  can  be  made  conditional  on  these  variables 
(see  also  Geweke  (1982),  Hendry  (1995,  Chapter  5),  Ericsson  (1994)  for  more 
discussion  of  exogeneity). 
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A  simple  technical  definition  is  to  call  Xt.  exogenous  if  xt,xt-i,  ■  ■  ■  ,Xt~s 
are  independent  of  the  error  term  uy.  Moreover,  xt  is  sometimes  called  strictly 
exogenous  if  all  its  leads  and  lags  are  independent  of  all  leads  and  lags  of  the 
error  process  Wt,  that  is,  if  Xt  and  wt  are  independent  processes.  Such  as¬ 
sumptions  simplify  derivations  of  properties  of  estimators  and  are  therefore 
convenient.  They  may  be  unnecessarily  restrictive,  however,  for  some  pur¬ 
poses.  In  the  following,  we  will  implicitly  make  the  assumption  that  Xt  and 
Wt  are  independent  processes  for  convenience,  although  most  results  can  be 
obtained  under  less  restrictive  conditions. 

For  much  of  the  present  discussion,  a  formal  definition  of  the  types  of 
variables  involved  is  not  necessary.  It  suffices  to  have  a  partitioning  into  two 
groups  of  variables.  The  reader  should,  however,  have  some  intuition  of  which 
variables  are  contained  in  yt  and  which  ones  are  included  in  a ;t.  As  mentioned 
previously,  roughly  speaking,  yt  contains  the  observable  outputs  of  the  system, 
that  is,  the  observable  variables  that  are  determined  by  the  system.  In  con¬ 
trast,  the  xt  variables  may  be  regarded  as  observable  input  variables  which  are 
determined  outside  the  system.  In  this  setting,  the  error  variables  wt  may  be 
viewed  as  unobservable  inputs  to  the  system.  As  we  have  seen,  nonstochastic 
components  may  be  absorbed  into  the  set  of  xt  variables.  All  or  some  of  the 
components  of  Xt  may  be  under  full  or  partial  control  of  the  government  or  a 
decision  or  policy  maker.  In  a  control  context,  such  variables  are  often  referred 
to  as  instruments  or  instrument  variables  (see  Section  10.7).  Sometimes  the 
lagged  endogenous  variables  together  with  the  exogenous  variables  of  a  system 
are  called  predetermined  variables.  If  xt  contains  just  a  constant  and  s  =  0, 
the  model  (10.1.1)  reduces  to  a  VAR  model,  provided  wt  is  white  noise. 

For  illustrative  purposes,  consider  the  following  example  system  relating 
investment  ( x\t ),  income  (yit),  and  consumption  (y2t)  variables: 


yit  =  +  <*ii,iyi,t-i  +  ai2,i2/2,t-i  +  Pii,iXi,t-i  +  wit, 


yit  =^2+  a22,l2/2,i  — 1  +  a2l,0j/li  +  «21,lJ/l,t-l  +  W2t- 


(10.2.1) 


This  model  is  similar  to  those  obtained  for  West  German  data  in  Chapter  5. 
An  important  difference  is  that  current  income  appears  in  the  consumption 
equation  and  there  is  no  equation  for  investment.  Thus,  only  income  and 
consumption  are  determined  within  the  system  whereas  investment  is  not. 
The  fact  that  investment  is,  of  course,  determined  within  the  economic  system 
as  a  whole  does  not  necessarily  mean  that  we  have  to  specify  its  generation 
mechanism  if  our  main  interest  is  with  the  generation  mechanism  of  income 
and  consumption.  In  terms  of  the  representation  (10.1.1),  the  example  system 
can  be  written  as 


1 

0  ' 

yit 

all,l  a12,l 

2/i,t— i 

a21,0 

1 

yit 

a21,l  a22,l 

.  y*,t-i . 

+ 

V1  P  12,1 

1 

+ 

Wit 

A?  0 

.  Xll- 1 

.  W2t  . 

(10.2.2) 
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Thus,  yt  =  (yit,y2t)'  and  xt  =  (l,2i t)'  are  both  two-dimensional.  The  prede¬ 
termined  variables  are  yt- 1  and  Xt-i- 

In  dynamic  SEMs  there  are  sometimes  identities  or  exact  relations  between 
some  variables.  For  instance,  the  same  figures  may  be  used  for  supply  and 
demand  of  a  product.  In  that  case,  an  identity  equating  supply  and  demand 
may  appear  as  a  separate  equation  of  a  system.  So  far  we  have  not  excluded 
this  possibility.  However,  in  later  sections  the  covariance  matrix  of  Wt  will  be 
assumed  to  be  nonsingular  which  excludes  identities.  Then  we  assume  without 
further  notice  that  they  have  been  eliminated  by  substitution.  For  instance, 
the  demand  variable  may  be  substituted  for  the  supply  variable  in  all  instances 
where  it  appears  in  the  system. 


10.2.2  Structural  Form,  Reduced  Form,  Final  Form 

The  representation  (10.1.1)  is  called  the  structural  form  of  the  model  if  it 
represents  the  instantaneous  effects  of  the  endogenous  variables  properly.  The 
instantaneous  effects  are  reflected  in  the  elements  of  A.  The  idea  is  that  the 
instantaneous  causal  links  are  derived  from  theoretical  considerations  and  are 
used  to  place  restrictions  on  A.  Of  course,  multiplication  of  (10.1.1)  with  any 
other  nonsingular  ( K  x  K)  matrix  results  in  an  equivalent  representation  of 
the  process  generating  yt.  Such  a  representation  is  not  called  a  structural 
form,  however,  unless  it  reflects  the  actual  relations  of  interest. 

The  reduced  form  of  the  system  is  obtained  by  premultiplying  (10.1.1)  with 
A-1  which  gives 


Vt  —  Aiyt-i  +  •  •  •  +  Apyt-p  +  BoXt  +  •  •  •  +  BsXt~s  +  Ut ,  (10.2.3) 

where  At  :=  A ~1A*  (i  =  1, . . .  ,p),  Bj  :=  A ~XB*  ( j  =  0,1,...,  s),  and  Ut  := 
A ~1wt-  We  always  assume  without  notice  that  the  inverse  of  A  exists.  In 
Sections  10.5-10.7,  we  will  see  that  the  reduced  form  is  useful  for  forecasting, 
multiplier  analysis,  and  control  purposes. 

For  the  example  model  given  in  (10.2.2),  we  have 


A"1 


1  0 
321,0  1 


and,  hence,  the  reduced  form  is 


yit 

—  -A-l 

2/i,  t-i 

+  B! 

1 

+ 

Ult 

2/2 1  _ 

.  V2,t-1  _ 

.  xht-l  _ 

.  U2t  . 

where 


(10.2.4) 


A  i 


<*11,1  <*12,1 
<*21,1  <*22,1 


<*11,1  <*12,1 
a21,0<*ll,l  +  <*21,1  a21,0<*12,l  +  <*22,1 


(10.2.5) 
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Bx 

and 


Pll,l  Pl2,X 

V1  012,1 

_  021,1  022,1 

321,0^1*  +  U2  S21, 0012,1 

Uxt 

Wit 

.  U2 1  _ 

a21,0 Wit  +  W2t  . 

(10.2.6) 


It  is  important  to  note  that  the  reduced  form  parameters  are  in  general  non¬ 
linear  functions  of  the  structural  form  parameters. 

In  lag  operator  notation,  the  reduced  form  (10.2.3)  can  be  written  as 


A(L)yt  =  B(L)xt  +  ut, 


(10.2.7) 


where 


A(L)  :=  1 K  -  AiL - ApLp 


and 


B{L)  :=B0  +  B1L+---  +  BSLS. 


If  the  effect  of  a  change  in  an  exogenous  variable  on  the  endogenous  variables  is 
of  interest,  it  is  useful  to  solve  the  system  (10.2.7)  for  the  endogenous  variables 
by  multiplying  with  A(L )_1.  The  resulting  representation, 

yt  =  D(L)xt  +  A(L)-1uu  (10.2.8) 

where  D(L)  :=  A{L)  1B(L ),  is  sometimes  called  the  final  form  of  the  system. 
Of  course,  using  A(L )_1  requires  invertibility  of  A(L)  which  is  guaranteed  if 

det  A(z)  7^  0  for  \z\  <  1.  (10.2.9) 


If  yt  contains  just  one  variable,  A(L)  is  a  scalar  operator  and  the  form  (10.2.8) 
is  often  called  a  distributed  lag  model  in  the  econometrics  literature  because  it 
describes  how  lagged  effects  of  changes  in  xt  are  distributed  over  time.  Because 
the  lag  distribution  for  each  exogenous  variable  can  be  written  as  a  ratio  of 
two  finite  order  polynomials  in  the  lag  operator  (A(L)~1B(L)),  the  model  is 
referred  to  as  a  rational  distributed  lag  model.  In  the  time  series  literature,  the 
label  rational  transfer  function  model  is  often  attached  to  (10.2.8)  in  both  the 
scalar  and  the  vector  case.  The  operator  D(L)  represents  the  transfer  function 
transferring  the  observable  inputs  into  the  outputs  of  the  system. 

For  the  example  model  with  reduced  form  (10.2.4),  we  get  a  final  form 


Vu 

yit 


(h 


AiL)-1  B\L 
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Xlt 


+  (f  2 
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Uxt 

U2t 


Ail~1B1Ll 


1 
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U2t 


(10.2.10) 
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Note  that  B0  =  0  and  thus,  D0  =  0  and  =  Aff1B\  for  i  =  1.2.  — 

The  coefficient  matrices  Di  =  {dkjff  of  the  transfer  function  operator 

OO 

D{L)  =  Y,  AT' 

i=0 

contain  the  effects  that  changes  in  the  exogenous  variables  have  on  the  en¬ 
dogenous  variables.  Everything  else  held  constant,  a  unit  change  in  the  j-th 
exogenous  variable  in  period  t  induces  a  marginal  change  of  dkj,i  units  in  the 
k- th  endogenous  variable  in  period  t  +  i.  The  elements  of  the  Dt  matrices 
are  therefore  called  dynamic  multipliers.  The  accumulated  effects  contained 
in  Y^i= o  A  are  the  n-th  interim  multipliers  and  the  elements  of  A  are 
the  long-run  effects  or  total  multipliers.  We  will  return  to  multiplier  analysis 
in  Section  10.6. 

As  in  the  example,  the  transfer  function  operator  D(L )  has  infinite  order 
in  general.  A  finite  order  representation  of  the  system  is  obtained  by  noting 
that  A(L)-1  =  A(L)ad3 /\A(L)\,  where  A(L)ad l  denotes,  as  usual,  the  adjoint 
of  A(L).  Thus,  multiplying  the  reduced  form  by  A(L)ad3  gives 

\A{L)\yt  =  A(L)ad3  B(L)xt  +  A(L)ad3ut  (10.2.11) 

which  involves  finite  order  operators  only.  In  the  econometrics  literature  these 
equations  are  sometimes  called  final  equations.  Because  |A(L)|  is  a  scalar 
operator,  each  equation  contains  only  one  of  the  endogenous  variables. 

Assuming  that  the  unmodelled  variables  Xt  are  driven  by  a  VAR(g)  pro¬ 
cess,  say 


Xt  —  ClXf.—  l  +  •  •  •  +  CqXt—q  +  Vf , 


where  q  <  p  and  vt  is  white  noise,  then  the  joint  generation  process  of  xt  and 
Vt  is 
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where  it  is  assumed  without  loss  of  generality  that  s,q  <  p,  Bj  :=  0  for  i  >  s 
and  Cj  :=  0  for  j  >  q.  If  ut  is  also  white  noise,  premultiplying  by 


Ik 

to 
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_ i 

I M 

shows  that  the  joint  generation  process  of  yt  and  xt  is  a  VAR(p). 
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10.2.3  Models  with  Rational  Expectations 

Sometimes  the  endogenous  variables  are  assumed  to  depend  not  only  on  other 
endogenous  and  exogenous  variables  but  also  on  expectations  on  endogenous 
variables.  If  only  expectations  formed  in  the  previous  period  for  the  present 
period  are  of  importance,  one  could  simply  add  another  term  involving  the  ex¬ 
pectations  variables  to  the  structural  form  (10.1.1).  Denoting  the  expectations 
variables  by  y ^  may  then  result  in  a  reduced  form 

Vt  =  Aiyt-i  +  •  •  •  +  Apyt-p  +  Fy%  +  B^xt  +  •  •  •  +  Bsxt~a  +  v-t  (10.2.12) 


or 


A(L)yt  =  Fyet+B(L)xt  +  ut,  (10.2.13) 

where  F  is  a  ( K  x  K)  matrix  of  parameters  and  A{L)  and  B{L)  are  the  matrix 
polynomials  in  the  lag  operator  from  (10.2.7). 

Following  Muth  (1961),  the  expectations  y\  formed  in  period  t  —  1  are 
called  rational  if  they  are  the  best  possible  predictions,  given  the  information 
in  period  t— 1.  In  other  words,  y f  is  the  conditional  expectation  Et-i(yt ),  given 
all  information  available  in  period  t  —  1.  In  forming  the  predictions  or  expecta¬ 
tions,  not  only  the  past  values  of  the  endogenous  and  unmodelled  variables  are 
assumed  to  be  known  but  also  the  model  (10.2.12)  and  the  generation  process 
of  the  unmodelled  variables.  It  is  easy  to  see  that,  if  the  unmodelled  variables 
are  generated  by  a  VAR  process,  the  expectations  variables  can  be  eliminated 
from  (10.2.12)/(10.2.13).  The  resulting  reduced  form  is  of  VARX  type.  To 
show  this  result,  suppose  that  Ut  is  independent  white  noise  and,  as  before, 
denote  by  Et  the  conditional  expectation,  given  all  information  available  in 
period  t.  Applying  Et-\  to  (10.2.12)  then  gives 

Vt  =  Et-  i  (Vt) 

=  Aiyt-i  +  •  •  •  +  Apyt-p 

-\-Fyl  +  Bf)Et-i(xt)  +  BiXt-i  +  •  •  •  +  BsXt~s  (10.2.14) 

or 

Vt  =  ( A{L )  -  IK)yt  +  Fy\  +  B0Et^(xt)  +  ( B{L )  -  B0)xt.  (10.2.15) 
Assuming  that  Ik  —  F  is  invertible,  this  system  can  be  solved  for  j/f : 

yt  =  ( Ik  ~  F)~l[(A{L)  -  IK)yt  +  R0^-i(^)  +  ( B{L )  -  B0)xt].  (10.2.16) 
If  xt  is  generated  by  a  VAR(g)  process,  say 

Xt  =  C\Xt-\  +  •  •  •  +  CqXt-q  +  Vt, 
where  vt  is  independent  white  noise,  then 
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Et.-i{xt )  —  C\Xt-\  +  ■  •  •  +  CqXt~q. 

Substituting  this  expression  in  (10.2.16)  shows  that  depends  on  lagged 
yt  and  xt  only.  Thus,  substituting  for  %ft  in  (10.2.12)  or  (10.2.13),  we  get  a 
standard  VARX  form  of  the  model. 

Thus,  in  theory,  when  the  true  coefficient  matrices  are  known,  we  can 
simply  eliminate  the  term  involving  expectations  variables  and  work  with 
a  standard  reduced  form  without  an  expectations  term.  It  should  be  clear, 
however,  that  substituting  the  right-hand  side  of  (10.2.16)  for  y |  in  (10.2.12) 
implies  nonlinear  restrictions  on  the  coefficient  matrices  of  the  reduced  form 
without  expectations  terms.  Taking  into  account  such  restrictions  may  in¬ 
crease  the  efficiency  of  parameter  estimators.  The  same  is  true,  of  course, 
for  the  structural  form.  Therefore,  it  is  important  in  practice  whether  or  not 
the  actual  relationship  between  the  variables  is  partly  determined  by  agents’ 
expectations. 

For  expository  purposes  we  have  just  treated  a  very  special  case  where  only 
expectations  formed  in  period  t  —  1  for  period  t  enter  the  model.  Extensions 
can  be  treated  in  a  similar  way.  For  instance,  past  expectations  for  more  than 
one  period  ahead  or  expectations  formed  in  various  previous  periods  may  be 
of  importance.  If  xt  is  generated  by  a  VAR(g)  process,  they  can  be  eliminated 
like  in  the  special  case  considered  in  the  foregoing. 

A  complication  of  the  basic  model  that  makes  life  a  bit  more  difficult  is 
the  inclusion  of  future  expectations.  It  is  quite  realistic  to  suppose  that,  for 
instance,  the  expected  future  price  of  a  commodity  may  determine  the  supply 
in  the  present  period.  For  example,  if  bond  prices  are  expected  to  fall  during 
the  next  period,  an  investor  may  decide  to  sell  now.  If  future  expectations 
enter  the  model,  the  solution  for  the  endogenous  variables  will  in  general  not 
be  unique.  In  other  words,  the  process  that  generates  the  endogenous  variables 
may  not  be  uniquely  determined  by  the  model,  even  if  the  generation  process 
of  the  exogenous  variables  is  uniquely  specified.  Further  extensive  discussions 
of  rational  expectations  models  can  be  found  in  volumes  by  Lucas  &  Sargent 
(1981)  and  Pesaran  (1987). 

10.2.4  Cointegrated  Variables 

Many  of  the  results  discussed  so  far  in  this  section  hold  for  systems  of  station¬ 
ary  or  integrated  variables.  More  precisely,  whenever  the  VAR  operator  A{L) 
is  not  required  to  be  invertible,  integrated  variables  may  be  present  as  en¬ 
dogenous  as  well  as  unmodelled  variables.  If  there  are  cointegrated  variables, 
it  may  be  preferable,  however,  to  separate  the  short-  and  long-run  dynam¬ 
ics  as  in  a  VECM.  Assuming  that  there  are  r  cointegration  relations  among 
the  endogenous  variables  and  they  are  not  cointegrated  with  the  unmodelled 
variables,  the  corresponding  form  of  the  model  is 

A Ayt  =  oc*P  yt-i  +  E\Ayt-i  +  ■  ■  ■  +  T*[)_1Ayt-p+\ 

ABqXi  +  B^Xt-i  +  •  •  •  +  B*Xt-s  +  Wt , 


(10.2.17) 
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where  A  is  a  ( K  x  K)  matrix  of  instantaneous  effects,  as  before,  a*  is  a  ( K  x  r) 
matrix  of  structural  loading  coefficients,  p  is  the  ( K  x  r)  cointegration  matrix, 
r*  (j  =  1, . . .  ,p  —  1)  is  a  (K  x  K )  matrix  of  structural  short-run  coefficients, 
and  all  other  symbols  are  defined  as  in  (10.1.1).  In  many  respects,  this  model 
can  be  dealt  with  in  essentially  the  same  way  as  the  VECMs  considered  in 
Part  II  of  this  volume. 

It  is  also  possible,  however,  that  there  is  cointegration  between  endogenous 
and  unmodelled  variables.  In  that  case,  a  suitable  form  of  the  model  is 


AAyt 


a*p+' 


yt-i 

Xt- 1 


+  r^Ayt-1  +  •  •  •  +  r*_1Ayt-p+ 1 


+T  *Axt  +  T  lAxt-i  +  ■■■  +  TUAxt-s+1  +  wt , 


(10.2.18) 


where  now  the  unmodelled  variables  appear  in  levels  form  in  the  error  correc¬ 
tion  term  only  and  otherwise  enter  in  differenced  form  with  suitable  coefficient 
matrices  Y*  (j  =  0, 1, . . . ,  s  —  1).  It  is  easy  to  see  that  such  a  model  form  can 
be  obtained  if  the  joint  generation  process  of  yt  and  Xt  has  a  (reduced  form) 
VECM  representation 


.  ^ x't  _ 

= 

a 

ax 

P+' 

yt- 1 

.  Xt~1  _ 

+ 

Id  Y  ,  ‘ 

0  Ti 

Ayt_  1 
Axt-i 

H - 


rp_i  Yp_i 

Ayt-p+i 

_i_ 

Ut 

0 

Axt-p-\- 1 

1 

.  Vt  . 

(10.2.19) 


where  p  >  s  is  assumed  without  loss  of  generality  and  all  symbols  have  obvious 
definitions.  Premultiplying  this  model  form  with 


A  -Yg 

0  1m 


gives  a  model  where  the  first  K  equations  are  just  the  structural  form 
(10.2.18).  Notice,  however,  that  the  yt  may  enter  the  xt  equations  in  (10.2.19) 
via  the  cointegration  relations  if  ax  ^  0.  It  turns  out  that  Xt  is  weakly  ex¬ 
ogenous  for  p+,  if  oCa,  =0.  Thus,  if  the  cointegration  relations  are  of  primary 
interest,  considering  the  partial  model  for  Ayt  is  justified  if  ax  =  0. 

Both  models  (10.2.17)  and  (10.2.18)  can  be  rewritten  in  levels  form.  The 
result  is  then  a  structural  form  as  in  (10.1.1).  Moreover,  the  structural  forms 
can  be  converted  into  reduced  form  by  premultiplying  with  A-1. 


10.3  Estimation 

Parameter  estimation  in  the  presence  of  unmodelled  variables  will  be  dis¬ 
cussed  separately  for  stationary  and  cointegrated  variables.  We  begin  with 
the  stationary  case. 
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10.3.1  Stationary  Variables 

Suppose  (y't,  x't)'  is  generated  by  a  stationary  process  and  we  wish  to  estimate 
the  parameters  of  the  reduced  form  (10.2.3)  which  can  be  written  as 


Ut  —  AYt_i  +  BXt- 1  +  B()Xt  +  ut-,  (10.3.1) 

where  A  :=  [A1,...,Ap],  B  :=  [B1,...,BS], 


Vt 

Xt 

. 

'■ 

,  Xt  := 

Vt—p+i  \ 

%t-s-\- 1 

Here  Ut  is  assumed  to  be  standard  white  noise  with  nonsingular  covariance 
matrix  SJU.  Moreover,  we  allow  for  parameter  restrictions  and  assume  that  a 
matrix  R  and  a  vector  7  exist  such  that 

(3  :=  vec[H,  B,  B0\  =  Rj.  (10.3.2) 

With  these  assumptions,  estimation  of  /3  and,  hence,  of  A,  B ,  and  B0  is 
straightforward . 

For  a  sample  of  size  T,  the  system  can  be  written  compactly  as 


Y  =  [A,B,B0\Z  +  U, 


(10.3.3) 


where 


y  ■=  [yi,---,yT],  z  ■= 


Yo,...,YT  ! 

X0,  ■  ■  ■ ,  Xt-  1 

Xi,...,XT 


and  U  :=  [iti, . . . ,  ut\- 


Vectorizing  gives 

y  =  (Z'®  IK)Rl  +  u, 

where  y  :=  vec(F)  and  u  :=  vec (U).  From  Chapter  5,  the  GLS  estimator  is 
known  to  be 

7  =  \R\ZZ’  ®  B-^R^R'iZ  ®  V"1) y.  (10.3.4) 

This  estimator  is  not  operational  because  in  practice  Eu  is  unknown.  However, 
as  in  Section  5.2.2,  Su  may  be  estimated  from  the  LS  estimator 

7  =  \R!(ZZ'  ®  l^R^R'iZ  ®  1K) y 


which  gives  residuals  u  =  y  —  (Z'  ®  lx)Rj  and  an  estimator 
Su  =  UU'/T 


(10.3.5) 
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of  Eu,  where  U  is  such  that  vec (U)  =  ii.  Using  this  estimator  of  the  white 
noise  covariance  matrix  results  in  the  EGLS  estimator 

^  =  \R\ZZ'  ®  Z~1)R]-1R\Z  <g>  E-^y.  (10.3.6) 

Under  standard  assumptions,  this  estimator  is  consistent  and  asymptotically 
normal, 

y/T@-nf)±M(0,E*),  (10.3.7) 

where 

^  =  (R' [plim(T_  1 ZZ')  <g>  S-^R)-1.  (10.3.8) 

One  condition  for  this  result  to  hold  is,  of  course,  that  both  plim  T~1ZZ'  and 
the  inverse  of  the  matrix  in  (10.3.8)  exist.  Further  assumptions  are  required 
to  guarantee  the  asymptotic  normal  distribution  of  the  EGLS  estimator.  The 
assumptions  may  include  the  following  ones:  (i)  Ut  is  standard  white  noise, 
(ii)  the  VAR  part  is  stable,  that  is, 

\A(z)\  =  | IK  -  Axz - Apzp |  ^  0  for  |«|  <  1, 

and  (iii)  xt  is  generated  by  a  stationary,  stable  VAR  process  which  is  inde¬ 
pendent  of  the  white  noise  process  ut.  A  precise  statement  of  more  general 
conditions  and  a  proof  are  given,  e.g.,  by  Hannan  &  Deistler  (1988).  The  latter 
part  of  our  set  of  assumptions  requires  that  all  the  exogenous  variables  are 
stochastic.  It  can  be  modified  so  as  to  include  nonstochastic  variables  as  well. 
In  that  case,  the  plim  in  (10.3.8)  reduces  to  a  nonstochastic  limit  in  some  or 
all  components  (see,  e.g.,  Anderson  (1971,  Chapter  5),  Harvey  (1981)). 

An  estimator  for  (3  =  Rj  is  obtained  as  (3  =  R-y.  If  (10.3.7)  holds,  this 
estimator  also  has  an  asymptotic  normal  distribution, 

VT0-(3)^M(O,2J,=RIJ^R'),  (10.3.9) 

Moreover,  under  general  conditions,  the  corresponding  estimator  Su  of  the 

white  noise  covariance  matrix  is  asymptotically  independent  of  /3  and  has  the 
same  asymptotic  distribution  as  the  estimator  UU' /T  based  on  the  unob¬ 
served  true  residuals.  For  instance,  for  a  Gaussian  process, 

Vf  vech(r„  -  Su)  4  AT(0, 2D+  (Eu  ®  U„)D+'),  (10.3.10) 

where  Dj,  =  (D^D/^)”1D)f  is  the  Moore-Penrose  inverse  of  the  (A'2  x 
iAT(AT  +  l))  duplication  matrix  D x- 

In  discussing  direct  reduced  form  estimation  with  white  noise  errors,  we 
have  treated  a  particularly  simple  case.  The  following  complications  are  pos¬ 
sible. 
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(1)  Usually  there  will  be  restrictions  on  the  structural  form  coefficients  A, 
A* ,  i  =  1, . . .  ,p,  and  B*,  j  =  0, . . . ,  s.  Such  restrictions  may  imply  nonlin¬ 
ear  constraints  on  the  reduced  form  coefficients  which  are  not  covered  by 
the  above  approach.  Rational  expectations  assumptions  may  be  another 
source  of  nonlinear  restrictions  on  the  reduced  form  parameters.  Theoreti¬ 
cally,  it  is  not  difficult  to  handle  nonlinear  restrictions  on  the  reduced  form 
parameters.  In  practice,  numerical  problems  may  arise  in  a  multivariate 
LS  or  GLS  estimation  with  nonlinear  restrictions. 

(2)  Interest  may  focus  on  the  structural  rather  than  the  reduced  form.  Es¬ 
timation  of  the  structural  form  has  been  discussed  extensively  in  the 
econometrics  literature.  For  recent  surveys  and  many  further  references 
see  Judge  et  al.  (1985),  Hausman  (1983),  or  textbooks  such  as  Hayashi 
(2000).  A  major  complication  in  estimating  the  structural  form  of  a  SEM 
such  as  (10.1.1)  results  from  its  possible  nonuniqueness.  Note  that  we 
have  not  assumed  a  triangular  A  matrix  or  a  diagonal  covariance  matrix 
of  wt.  Premultiplication  of  (10.1.1)  by  any  nonsingular  matrix  results  in 
an  equivalent  representation  of  the  process.  Thus,  for  proper  estimation 
there  must  be  restrictions  on  the  structural  form  coefficients  that  guaran¬ 
tee  uniqueness  or  identification  of  the  structural  form  coefficients. 

(3)  So  far  we  have  just  discussed  models  which  are  linear  in  the  variables. 
In  practice,  there  may  be  nonlinear  relations  between  the  variables.  Es¬ 
timation  of  nonlinear  dynamic  models  where  the  endogenous  as  well  as 
the  unmodelled  conditioning  variables  may  enter  in  a  nonlinear  way  are, 
for  instance,  discussed  by  Bierens  (1981),  Gallant  (1987),  and  Gallant  & 
White  (1988). 

In  the  next  section,  we  will  consider  models  with  integrated  and  cointe¬ 
grated  variables. 


10.3.2  Estimation  of  Models  with  1(1)  Variables 


If  there  are  integrated  and  cointegrated  variables  in  the  model  and  a  reduced 
form  VECM  corresponding  to  the  structural  form  (10.2.18), 


Ayt 


a|3 


+/ 


yt- 1 

Xt- 1 


+  +  •  •  •  +  Tp_iAyt_ 


p+ i 


+Yo Axt  +  YiZ\xt_i  +  •  •  •  +  -I-  Ut, 


(10.3.11) 


is  set  up,  estimation  can  in  principle  proceed  as  in  Section  7.2.  Assuming  that 
a  sample  of  size  T  and  all  required  presample  values  are  available  and  defining 


AY  ■-  [Ayi, . . . ,  AyT\, 

Y-i  ■=  [yd-,  •  •  • , Vt~ iL  with  vt- 1  := 
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Ayt- 1 


ZiX+  :=  [AX+, . . . ,  AX+_x]  with  AX+_±  := 


Ayt-p+i 

Axt 
Axt- 1 


Axf— 5-(-i 


and 


U  :=  [ui, . .  .,uT], 

we  get 

AY  =  ap+'F+i  +  T+AX+  +  U,  (10.3.12) 

where 


I"  :=  [r  1:---:Tp_1:T0:T1:---:Ts_1\. 

Thus,  we  have  precisely  the  same  model  form  as  in  Section  7.2  (see,  e.g., 
(7.2.3))  and,  in  principle,  all  the  estimators  of  that  section  are  available.  No¬ 
tice,  however,  that  now  p+  is  a  (( K  +  M)  x  r)  matrix  whereas  a  is  still 
(K  x  r).  Because  the  error  correction  term  now  involves  all  the  cointegration 
relations  between  the  endogenous  and  unmodelled  variables,  it  is  possible  that 
r  >  K.  In  that  case,  it  is  easy  to  see  that  most  of  the  estimators  of  Section 
7.2  are  not  available.  Thus,  we  have  to  assume  that  r  <  K.  In  fact,  if  r  =  K, 
the  matrix  II+  :=  ap+/  is  of  full  row  rank  under  our  usual  assumption  that 
rk(a)  =  rk(P+)  =  r.  Therefore,  if  K  =  r,  we  do  not  even  need  reduced 
rank  regression  but  can  simply  estimate  the  matrix  II+  =  ap+/  by  applying 
multivariate  LS  to  (10.3.12).  An  estimator  of  p+  can  then  be  obtained  by 
normalizing  the  cointegration  matrix  as  in  Section  7.2  such  that 


P+ 


Ik 

Pw  . 


(10.3.13) 


and,  using 

p+'  =  (n+))-1n+, 

where  II ^  is  the  ( K  x  K )  submatrix  consisting  of  the  first  K  columns  of  the 
LS  estimator  II+  of  II+ . 

If  r  <  K,  there  is  nothing  special  here  relative  to  the  procedures  discussed 
in  Section  7.2.  Reduced  rank  ML  estimation,  as  discussed  in  Section  7.2.3, 
is  available  just  as  the  EGLS  estimator  of  the  cointegration  parameters  of 
Section  7.2.2  and  the  two-stage  estimator  described  in  Section  7.2.5.  Moreover, 
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the  two-stage  procedure  can  also  be  used  to  estimate  models  with  parameter 
restrictions  on  a  and  r+,  as  in  Section  7.3.2.  In  fact,  a  similar  procedure  can 
even  be  used  for  the  estimation  of  structural  form  models  of  the  type  (10.2.18). 

In  this  context,  it  is,  of  course,  of  interest  to  know  the  properties  of  the 
resulting  estimators.  They  are  available  under  suitable  assumptions  for  the 
model  and  the  variables  (see,  e.g.,  Johansen  (1992)  or  Davidson  (2000,  Section 
16.5)).  Under  general  assumptions,  the  estimator  of  the  cointegration  matrix 
continues  to  be  superconsistent,  that  is, 

T(p+-p+)  =  Op(l), 

if  all  variables  are  at  most  7(1)  and  p+  is  identified.  If  the  cointegration 
relations  do  not  enter  the  generation  process  of  xt,  that  is,  =  0  in  (10.2.19), 
Xt  is  weakly  exogenous  for  p+  and  the  ML  and  EGLS  estimators  of  p+  have 
mixed  normal  distributions  similar  to  those  discussed  in  Section  7.2.  Therefore 
standard  inference  is  possible,  as  discussed  in  that  section.  The  estimators  of 
the  a  and  T+  parameters  have  again  standard  properties  which  are  the  same 
as  in  the  case  where  the  p+  matrix  is  known. 


10.4  Remarks  on  Model  Specification  and  Model 
Checking 

The  basic  principles  of  model  specification  and  checking  the  model  adequacy 
have  been  discussed  in  some  detail  in  previous  chapters.  We  will  therefore 
make  just  a  few  remarks  here.  With  respect  to  the  specification  there  is,  how¬ 
ever,  a  major  difference  between  the  models  considered  previously  and  the 
dynamic  SEMs  of  this  chapter.  While  in  a  reduced  form  VAR  analysis  usually 
relatively  little  prior  knowledge  from  economic  or  other  subject  matter  theory 
is  used,  such  theories  may  well  be  the  major  building  block  in  specifying  SEMs. 
In  that  case,  model  checking  becomes  of  central  importance  in  investigating 
the  validity  of  the  theory.  Quite  often,  theories  are  not  available  that  specify 
the  data  generation  process  completely.  For  instance,  the  lag  lengths  of  the  en¬ 
dogenous  and/or  exogenous  variables  may  have  to  be  specified  with  statistical 
tools.  Also,  some  researchers  may  not  be  prepared  to  rely  on  the  available  the¬ 
ories  and  therefore  prefer  to  substitute  statistical  investigations  for  uncertain 
prior  knowledge.  Statistical  specification  strategies  for  general  dynamic  SEMs 
were,  for  instance,  proposed  and  discussed  by  Hannan  &  Kavalieris  (1984), 
Hannan  &  Deistler  (1988),  and  Poskitt  (1992).  These  strategies  are  based  on 
model  selection  criteria  of  the  type  considered  in  previous  chapters.  An  ex¬ 
tensive  literature  exists  on  the  specification  of  special  models.  For  instance, 
distributed  lag  models  are  discussed  at  length  in  the  econometrics  literature 
(for  some  references  see  Judge  et  al.  (1985,  Chapters  9  and  10)).  Specification 
proposals  for  transfer  function  models  with  one  dependent  variable  yt  go  back 
to  the  pioneering  work  of  Box  &  Jenkins  (1976).  Other  suggestions  have  been 
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made  by  Haugh  &  Box  (1977),  Young,  Jakeman  &  McMurtrie  (1980),  Liu  & 
Hanssens  (1982),  Tsay  (1985),  and  Poskitt  (1989)  to  name  just  a  few. 

If  some  of  the  variables  are  integrated,  one  may  also  want  to  investigate  the 
number  of  cointegration  relations  with  statistical  tests.  From  the  discussion 
in  Section  10.3.2,  it  is  clear  that  rank  tests  can  be  used  for  that  purpose,  as  in 
Section  8.2.  These  tests  may  now  be  based  either  on  a  VECM  for  the  full  joint 
generation  process  of  yt  and  Xt  or  on  a  partial  model  with  some  unmodelled 
variables.  The  latter  approach  may  be  preferable  if  a  large  number  of  variables 
is  involved.  Johansen’s  LR  tests  for  the  cointegrating  rank  may  be  unreliable 
in  that  situation  because  of  size  distortions  and  lack  of  power.  Therefore, 
testing  for  the  cointegrating  rank  in  a  partial  model  may  be  advantageous.  The 
asymptotic  distributions  of  the  relevant  LR  test  statistics  in  this  case  depend 
on  the  conditioning  variables,  however.  This  result  is  not  surprising,  of  course, 
because  the  conditioning  variables  can  in  fact  be  deterministic  terms  and  we 
have  seen  in  Section  8.2  that  such  terms  have  an  impact  on  the  asymptotic 
properties  of  the  LR  tests.  The  relevant  tests  for  conditional  models  were 
derived  by  Harbo,  Johansen,  Nielsen  &  Rahbek  (1998)  and  critical  values 
were  given  in  MacKinnon,  Haug  &  Michelis  (1999). 

In  checking  the  model  adequacy  one  may  want  to  test  various  restrictions. 
These  may  range  from  constraints  suggested  by  some  kind  of  theory  such 
as  the  rational  expectations  hypothesis,  to  tests  of  the  significance  of  extra 
lags.  The  three  testing  principles  discussed  previously,  namely  the  LR,  LM, 
and  Wald  principles  (see  Appendix  C.7)  can  be  used  in  the  present  context. 
Their  asymptotic  properties  follow  in  the  usual  way  from  properties  of  the 
estimators  and  the  model. 

A  residual  analysis  is  another  tool  which  is  available  in  the  present  case. 
Plots  of  residuals  may  help  to  identify  unusual  values  or  patterns  that  suggest 
model  deficiencies.  Plots  of  residual  autocorrelations  may  aid  in  checking  the 
white  noise  assumption.  Also  a  portmanteau  test  for  overall  residual  autocor¬ 
relation  may  be  developed  for  dynamic  models  with  exogenous  variables;  see 
Poskitt  &  Tremayne  (1981)  for  a  discussion  of  this  issue  and  further  references. 


10.5  Forecasting 

10.5.1  Unconditional  and  Conditional  Forecasts 

If  the  future  paths  of  the  unmodelled  variables  are  unknown  to  the  forecaster, 
then  forecasts  of  these  variables  are  needed  in  order  to  predict  the  future  values 
of  the  endogenous  variables  on  the  basis  of  a  dynamic  SEM.  For  simplicity, 
suppose  that  the  exogenous  variables  are  generated  by  a  zero  mean  VAR(q) 
process  as  in  Section  10.2.3, 


Xt  —  ClXt—l  +  •  •  •  +  CqXt—q  +  Vt- 


(10.5.1) 
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Now  this  process  can  be  used  to  produce  optimal  forecasts  xt{h)  of  xt  in  the 
usual  way.  If  the  endogenous  variables  are  generated  by  the  reduced  form 
model  (10.2.3)  with  being  independent  white  noise  which  is  also  indepen¬ 
dent  of  the  xt  process,  the  optimal  h- step  forecast  of  yt+h  at  origin  t  is 


Vt(h)  =  Ayyt(h-l)-\ - \-Apyt(h-p)+B0xt{h)-\ - \-Bsxt(h-s),  (10.5.2) 


where  yt(j)  '■=  yt+j  and  xt(j)  :=  a :t+j  for  j  <  0.  This  formula  can  be  used  for 
recursively  determining  forecasts  for  h  =  1,2,.... 

An  alternative  way  for  getting  these  forecasts  is  obtained  by  writing  the 
generation  processes  of  the  exogenous  variables  in  one  overall  model  together 
with  the  reduced  form  SEM: 


'  Ik 

to 

O 

yt 

"  Ai 

Bx  ' 

yt- 1 

0 

Im 

.  xt . 

0 

Cx  . 

.  xt- 1 

+ 

Bp 

yt—p 

+ 

Ut 

L  u 

cP  _ 

%t—p 

.  Vt  . 

(10.5.3) 


where  we  assume  without  loss  of  generality  that  p  >  max(s,  q)  and  set  Bi  =  0 
for  i  >  s  and  Cj  =  0  for  j  >  q.  As  in  Section  10.2.2,  premultiplying  by 


'  Ik 

to 

O 

1 

1 

'  Ik 

B0  ' 

0 

1 — 

s 

O 

_ I 

Im 

gives  a  standard  reduced  form  VAR.(p)  model.  It  is  easy  to  see  that  the  optimal 
forecasts  for  yt  and  Xt  from  that  model  are  exactly  the  same  as  those  obtained 
by  getting  forecasts  for  Xt  from  (10.5.1)  first  and  using  them  in  the  prediction 
formula  for  yt  given  in  (10.5.2)  (see  Problem  10.5).  Thus,  under  the  present 
assumptions,  the  discussion  of  forecasting  VAR(p)  processes  applies.  It  will 
not  be  repeated  here.  Also,  it  is  not  difficult  to  extend  these  ideas  to  sets  of 
unmodelled  variables  with  nonstochastic  components  such  as  intercept  terms 
or  seasonal  dummies. 

We  will  refer  to  forecasts  of  yt  obtained  in  this  way  as  unconditional  fore¬ 
casts  because  they  are  based  on  forecasts  of  the  exogenous  variables  for  the 
forecast  period.  Occasionally,  the  forecaster  may  know  some  or  all  of  the  fu¬ 
ture  values  of  the  exogenous  variables,  for  instance,  because  they  are  under 
the  control  of  some  decision  maker.  In  that  case  he  or  she  may  be  interested 
in  forecasts  of  yt  conditional  on  a  specific  future  path  of  ay.  In  order  to  de¬ 
rive  the  optimal  conditional  forecasts,  we  write  the  reduced  form  (10.2.3)  in 
VARX(1, 0)  form, 


Tt  —  Al't_!  +  Bay  +  Ut , 


(10.5.4) 


where 
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Yt  ■= 


yt 


yt~p+ 1 

Xt 


Xt—  s+1 


ut  ■■= 


ut 

0 
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0  . 

1m 

0 

(( Kp+Ms )  X  (Kp+Ms)) 
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B  := 


{Kp  x  M) 
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Successive  substitution  for  lagged  Yt’s  gives 
h- 1  h- 1 

Yt  =  A hYt.h  +  ^  A‘Bxt_i  +  ^  A^t-i. 


2=0 


2=0 


(10.5.5) 


Hence,  premultiplying  by  the  (X  x  (Kp  +  Ms))  matrix  J  :=  lK  :():•••:() 
results  in 


h —  1 


h-1 


Vt+h 


=  JAh  Yt  +  JA'Bxt+b^  +  •/A',/' 


+  — 2) 


(10.5.6) 


2—0 


2—0 


where  Ut  =  J'  JUt  =  J'ut.,  has  been  used.  Now  the  optimal  h- step  forecast  of 
yt  at  origin  t,  given  Xt.+i,  ■  ■  ■  ,Xt+h,  and  all  present  and  past  information,  is 
easily  seen  to  be 


h- 1 


Ut{h\x)  :=  JA,lFt  +  ^  JAlBxt+h_j 


*=o 


(10.5.7) 
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and  the  corresponding  forecast  error  is 


h-t 

yt+h  -  yt(h\x)  =  ^2  JA.lJ'ut+h-i.  (10.5.8) 

i= 0 

Thus,  the  MSE  of  the  conditional  forecast  is 

h- 1 

Sy(h\x)  :=  MSE[j/t(/i|x)]  =  ^  JAiJ'EuJ(Ai)'J'.  (10.5.9) 

2=0 

Although  this  MSE  matrix  formally  looks  like  the  MSE  matrix  of  the  optimal 
forecast  from  a  VAR  model,  where  JAlJ'  is  replaced  by  ‘Pi,  the  MSE  matrix 
in  (10.5.9)  is  in  general  different  from  the  one  of  an  unconditional  forecast. 
This  fact  is  easy  to  see  by  considering  the  different  definition  of  the  matrix  A 
used  in  the  pure  VAR(p)  case. 

To  illustrate  the  difference  between  conditional  and  unconditional  fore¬ 
casts,  we  consider  the  simple  reduced  form 


Vt  =  Myt-i  +  B0xt  +  ut,  (10.5.10) 

where  Xt  is  assumed  to  be  generated  by  a  zero  mean  VAR(l)  process, 


xt  =  CiXt-i  +  vt. 


Moreover,  we  assume  that  Ut  and  vt.  are  independent  white  noise  processes 
with  covariance  matrices  Eu  and  Ev,  respectively.  The  unconditional  forecasts 
are  obtained  from  the  VAR  process 
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has  the  standard  VAR(l)  from 
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The  optimal  1-step  forecast  from  this  model  is 


’  2/t(l)  ' 

_  xt{l) 

BqC\ 

Ci 


Vt 

xt 


The  corresponding  MSE  matrix  is 
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A„(l)  =  E 


ut  +  B0vt 
vt 

3qX/v13q 
EvBq 


[( Ut  +  B0vty,v't\ 


BqEv 

Ev 


The  upper  left-hand  corner  block  of  this  matrix  is  the  MSE  matrix  of  j/t(  1), 
the  unconditional  forecast  of  the  endogenous  variables.  Thus, 


Sy{l)  =EU  +  B0EVB'0.  (10.5.11) 

On  the  other  hand,  in  the  VARX(1, 0)  representation  (10.5.4),  we  have  A  =  A\ 
and  B  =  Bo  for  the  present  example.  Hence,  the  conditional  1-step  forecast 
of  yt  is 


Vt(  l|x)  =  Aiyt  +  B0xt+ 1 
with  corresponding  MSE  matrix 

Ey(l\x)  =  Eu. 

Obviously,  Ey(l)  —  Ey(l\x)  =  B0EVB'0  is  positive  semidefinite  and,  thus,  the 
unconditional  forecast  is  inferior  to  the  conditional  forecast,  if  Bo  ^  0.  It 
must  be  kept  in  mind,  however,  that  the  conditional  forecast  is  only  feasible 
if  the  future  values  of  the  exogenous  variables  are  either  known  or  assumed.  If 
only  hypothetical  values  are  used,  the  conditional  forecast  may  be  quite  poor 
if  the  actual  values  of  the  exogenous  variables  turn  out  to  be  different  from 
the  hypothetical  ones.  The  smaller  MSE  of  the  conditional  forecast  is  simply 
due  to  ignoring  any  uncertainty  regarding  the  future  paths  of  the  exogenous 
variables. 

Using  the  foregoing  results,  interval  forecasts  and  forecast  regions  can  be 
set  up  as  usual.  It  may  also  be  worth  pointing  out  that  we  have  not  used 
the  stability  of  the  VAR  operator  or  stationarity  of  the  variables.  Hence,  the 
formulas  are  also  valid  for  systems  with  integrated  and  cointegrated  variables. 
So  far  we  have  discussed  forecasting  with  known  models.  The  case  of  estimated 
models  will  be  considered  next. 


10.5.2  Forecasting  Estimated  Dynamic  SEMs 

In  order  to  evaluate  the  consequences  of  using  estimated  instead  of  known 
processes  for  unconditional  forecasts,  we  can  use  a  joint  model  for  the  en¬ 
dogenous  and  exogenous  variables  and  then  draw  on  results  of  the  previous 
chapters.  Therefore,  in  this  section  we  will  focus  on  conditional  forecasts  only. 
We  denote  by  yt(h\x)  the  conditional  /i-step  forecast  (10.5.7)  based  on  the 
estimated  reduced  form  (10.2.3).  The  forecast  error  is 


yt+h  ~  yt{h\x )  =  {yt+h  ~  Vt{h\x)\  +  [yt{h \x)  -  yt{h\x)\. 


(10.5.12) 
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Conditional  on  the  exogenous  variables,  the  two  terms  in  brackets  are  uncor¬ 
related.  Hence,  assuming,  as  in  previous  chapters,  that  the  processes  used  for 
estimation  and  forecasting  are  independent,  an  MSE  approximation 


£y(h\x)  =  £v(h\x)  +  ^ny(h\x) 

is  obtained  in  the  by  now  familiar  way.  Here 


Qy(h\x)  :=  E 


dyt{h\x)  ^  dyt(h\x)' 
80  *  8(3 


(10.5.13) 


(10.5.14) 


(3  :=  vec[Hi, . . .  ,Ap,B i, . . . ,  BSl  B0)  and  Up  is  the  covariance  matrix  of  the 
asymptotic  distribution  of  VT{(3  —  0.  It  is  straightforward  to  show  that 

8yt{h\x)  =  8{JAhYt )  y  8(  JA‘Bxt+h_i) 

80  ~  80  ^  80 
h- 1 

=  [Yt( A1)*1-1-1  <g>  JAlJ' 

i= 0 

i-1 

+  J2  Xt+h-i&iA'y-1-’  ®  J A3 ®  JA\r  . 

3=0 

(10.5.15) 


For  stationary  processes,  an  estimator  of  fty(h\x)  is  obtained  in  the  usual 
way  be  replacing  all  unknown  parameters  in  this  expression  and  in  by 
estimators  and  by  using  the  average  over  t  =  1, . . . ,  T  for  the  expectation  in 
(10.5.14). 

Although  we  have  discussed  forecasting  with  estimated  coefficients  in 
terms  of  a  simple  VARX(p,  s)  model  with  white  noise  residuals,  it  is  pos¬ 
sible  to  generalize  these  results  to  models  with  autocorrelated  error  processes. 
The  more  general  case  was  treated,  for  instance,  by  Yamamoto  (1980)  and 
Baillie  (1981). 


10.6  Multiplier  Analysis 

In  an  econometric  simultaneous  equations  analysis,  the  marginal  impact  of 
changes  in  the  exogenous  variables  is  sometimes  investigated.  For  example,  if 
the  exogenous  variables  are  instruments  for,  say,  the  government  or  a  central 
bank  the  consequences  of  changes  in  these  instruments  may  be  of  interest. 
A  government  may,  for  instance,  desire  to  know  the  effects  of  a  change  in 
a  tax  rate.  In  that  case,  policy  simulation  is  of  interest.  In  other  cases,  the 
consequences  of  changes  in  the  exogenous  variables  that  are  not  under  the 
control  of  any  decision  maker  may  be  of  interest.  For  instance,  it  may  be 
desirable  to  study  the  future  consequences  of  the  present  weather  conditions. 
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Therefore,  the  dynamic  multipliers  discussed  in  Section  10.2.2  are  consid¬ 
ered.  They  are  contained  in  the  Di  matrices  of  the  final  form  operator, 

OO 

D{L)  =  Y,DiLi  :=  A{L)~1B(L), 

i—0 

where  A(L)  :=  Ik  —  A±L  —  •••  —  ApLp  and  B{L)  :=  Bq  +  B\L+  ■  ■  ■  +  BSLS  are 
the  reduced  form  operators,  as  before.  Here  stability  and,  hence,  invertibility 
of  the  VAR  operator  A(L)  is  assumed.  The  Di  matrices  are  conveniently 
obtained  from  the  VARX(1,0)  representation  (10.5.4)  which  implies 

OO  OO 

Vt  =  Y  JA‘J'ut-i,  (10.6.1) 

2=0  2=0 

because  JAhYt  — >  0  as  h  — >  oo,  if  yt  is  a  stable,  stationary  process  (see 
(10.5.6)).  The  Di  s  are  coefficient  matrices  of  the  exogenous  variables  in  the 
final  form  representation.  Thus, 

Di  =  JAiB,  i  =  0,1,...,  (10.6.2) 

the  n-th  interim  multipliers  are 

Mn  '■=  Dq  +  D\  +  •  •  •  +  Dn  =  J(/+A  +  --  -  +  A'!)B,  71  =  0,1,...,  (10.6.3) 

and  the  total  multipliers  are 

OO 

:=  Y  A  =  J(I  ~  A)-xB  =  A{l)~lB{  1).  (10.6.4) 

i— 0 

If  the  model  contains  integrated  variables  and  the  generation  mechanism 
is  started  at  time  t  =  0,  say,  from  a  set  of  initial  values,  then  we  get  from 
(10.5.5), 


t-i  t- l 

yt  =  JA%  +  Y  -lA'Bxt-i  +  Y  JA}J'ut-i.  (10.6.5) 

2=0  2=0 

Thus,  the  Di  matrices  in  (10.6.2)  still  reflect  the  marginal  impacts  of  changes 
in  the  unmodelled  variables  and,  hence,  contain  the  multipliers.  Also  the  n-th 
interim  multipliers  can  be  computed  as  in  (10.6.3),  whereas  the  total  multi¬ 
pliers  in  (10.6.4)  will  not  exist  in  general. 

Having  obtained  the  foregoing  representations  of  the  multipliers,  estima¬ 
tion  of  these  quantities  is  straightforward.  Estimators  of  the  dynamic  mul¬ 
tipliers  are  obtained  by  substituting  estimators  A,  and  Bj  of  the  coefficient 
matrices  in  A  and  B.  The  asymptotic  properties  of  the  estimators  then  fol¬ 
low  in  the  usual  way.  For  completeness  we  mention  the  following  result  from 
Schmidt  (1973). 
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In  the  framework  of  Section  10.3,  suppose  (3  is  a  consistent  estimator  of 
(3  ■=  vec[A,B,B0)  satisfying 


Vt(3-/3)^AA(0,^). 

Then 

v/Tvec(A  -  A)  ^  AA(0,  GiE^G'i), 
where  Go  :=  [0  :  I  km]  and 


(10.6.6) 


Gi  := 


<9vec(A) 

d(3' 


i- 1 


J2  B'(A')<_  1_J'  ®  JAJ  J'  :  I M  ®  JAiJ'  , 

3=0 

i  =  1,2,..., 


are  [AM  x  (K2p  +  KM(s  +  1))]  matrices.  The  proof  of  this  result  is  left  as 
an  exercise.  It  is  also  easy  to  find  the  asymptotic  distribution  of  the  interim 
multipliers  (accumulated  multipliers)  and  the  total  multipliers  if  they  exist 
(see  Problem  10.8). 


10.7  Optimal  Control 

A  policy  or  decision  maker  who  has  control  over  some  of  the  exogenous  vari¬ 
ables  can  use  a  dynamic  simultaneous  equations  model  to  assess  interventions 
with  a  multiplier  or  simulation  analysis,  as  described  in  the  previous  section. 
However,  if  the  decision  maker  has  specific  target  values  of  the  endogenous 
variables  in  mind,  he  or  she  may  wish  to  go  a  step  further  and  determine 
which  values  of  the  instrument  variables  will  produce  the  desired  values  of  the 
endogenous  variables. 

Usually  it  will  not  be  possible  to  actually  achieve  all  targets  simultane¬ 
ously  and  sometimes  the  decision  maker  is  not  completely  free  to  choose  the 
instruments.  For  instance,  doubling  a  particular  tax  rate  or  increasing  the 
price  of  specific  government  services  drastically  may  result  in  the  overthrow 
of  the  government  or  in  social  unrest  and  is  therefore  not  a  feasible  option. 
Therefore,  a  loss  function  is  usually  set  up  in  which  the  loss  of  deviations 
from  the  target  values  is  specified.  For  instance,  if  the  desired  paths  of  the 
endogenous  and  instrument  variables  after  period  T  are  y^+1,  ■  ■  ■  iUt+u  anc^ 
xt+ii  •  •  • ,  x§-+n,  respectively,  a  quadratic  loss  function  has  the  form 

n 

£  =  X)[(s/t+»  -  VT+i)' Kt(yT+i  -  Vr+i) 

i= 1 

+  (a ’T+i  —  xT+i)'  Pi(xT+i  —  xT+i)]i  (10.7.1) 

where  the  K,  and  P,  are  symmetric  positive  semidefinite  matrices.  Because 
the  variables  are  assumed  to  be  stochastic,  the  loss  is  a  random  variable  too. 
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Therefore,  minimization  of  the  average  or  expected  loss,  E(£),  is  usually  the 
objective. 

In  a  quadratic  loss  function  the  same  weight  is  assigned  to  positive  and 
negative  deviations  from  the  target  values.  For  many  situations  and  variables 
this  specification  is  not  quite  realistic.  For  example,  if  the  target  is  to  have  an 
unemployment  rate  of  2%,  then  having  less  than  2%  may  not  be  a  problem  at 
all  while  any  higher  rate  may  be  regarded  as  a  serious  problem.  Nevertheless, 
quadratic  loss  functions  are  the  most  common  ones  in  applied  and  theoretical 
studies.  Therefore,  we  will  also  use  them  in  the  following.  One  reason  for  the 
popularity  of  this  type  of  loss  function  is  clearly  its  tractability. 

In  order  to  approach  a  formal  solution  of  the  optimal  control  problem 
outlined  in  the  foregoing,  we  assume  that  the  economic  system  is  described 
by  a  model  like  (10.1.1)  with  reduced  form  (10.2.3).  However,  to  be  able  to 
distinguish  between  instrument  variables  and  other  exogenous  variables,  we 
introduce  a  new  symbol  for  the  latter.  Suppose  Xt  represents  an  ( M  x  1)  vector 
of  instrument  variables,  the  (N  x  1)  vector  zt  contains  all  other  unmodelled 
variables  and  the  reduced  form  of  the  model  is 

Vt  =  Aiyt-i  +  •  •  •  +  Apijt-p  +  BgXt  +  •  •  •  +  BsXt-s  +  Czt.  +  ut ,  (10.7.2) 

where  Ut  is  white  noise.  Some  of  the  components  of  Zt  may  be  lagged  variables. 
To  summarize  them  in  a  vector  indexed  by  t  is  just  a  matter  of  convenience. 

For  the  present  purposes,  it  is  useful  to  write  the  model  in  VARX(1,0) 
form  similar  to  (10.5.4), 


Yt  —  Alt_i  +  Bay  +  C  Zt  +  Ut , 


(10.7.3) 


where  Yt,  Ut,  A,  and  B  are  as  defined  in  (10.5.4)  and 


"  C  " 

C  :=  ° 

_  0  _ 

is  a  (( Kp  +  Ms)  x  N)  matrix.  Recall  that 


Vt 


Yt  := 


Vt—p+i 

xt 


Xf—  s+l 


contains  current  and  lagged  endogenous  and  instrument  variables.  Thus,  the 
quadratic  loss  function  specified  in  (10.7.1)  may  be  rewritten  in  the  form 
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71 

£  =  -  YT+i)'Qi(YT+i  ~  *r+i)>  (10.7.4) 

2  =  1 

where  the  Qi  are  symmetric  positive  semidefinite  matrices  involving  the  Kt  !s 
and  Pi  s. 

In  this  framework,  the  problem  of  optimal  control  may  be  stated  as  follows: 
Given  the  model  (10.7.3),  given  the  vector  Yp,  given  values  zt+i,  ■  ■  ■ ,  zr+n 
of  the  uncontrolled  variables  and  given  target  values  Vt+ii  ■  ■  ■  iUt+u  aRd 
Xj<+1, . . . , i§.,n,  find  the  values  x^+1, . . . , x?r+n  that  minimize  the  expected 
loss  E(£)  specified  in  (10.7.4).  The  solution  to  this  dynamic  programming 
problem  is  well  documented  in  the  control  theory  literature.  It  turns  out  to 
be 


x*T+i  =  GiYT+i-1+gi ,  i=  1, . . .  ,n,  (10.7.5) 

where  the  YT+i  are  assumed  to  be  obtained  as 

Yr+i  =  Alr+i-i  +  B  x^+i  +  Czx+i  +  ux+i- 

Here  the  ( M  x  ( Kp  +  Ms))  matrix  G,;  is  defined  as 
Gi  := 

and  the  (M  x  1)  vector  gi  is  defined  as 

gi  :=  -{B' HiB)-1# {HiCzr+i  -  K) 

with 

P a  '■=  Qn  and  Pi-i  :=  Qi- 1  +  (A  +  BGiY Pi{A  +  BGj), 

for  i  =  1, . . . ,  n  —  1, 

and 

K  :=  QnYrr+n  and 

hi- 1  :=  Qi-iY®+i_1  —  A'Ht(CzT+i  +  Bg.j )  +  A1  hi 

for  i  =  1, . . . ,  n  —  1. 

The  actual  computation  of  these  quantities  proceeds  in  the  order  Pn ,  Gn, 
hn,  gni  Hn- 1,  Gn- 1,  hn-i,  gn-i,  Hn- 2,  •  •  This  solution  can  be  found  in 
various  variations  in  the  control  theory  literature  (e.g.,  Chow  (1975,  1981), 
Murata  (1982)).  Obviously,  because  the  Yt  are  random,  the  same  is  true  for 
the  optimal  decision  rule  i  =  1 , ,n. 

There  are  a  number  of  problems  that  arise  in  practice  in  the  context  of 
optimal  control  as  presented  here.  For  instance,  we  have  considered  a  finite 
planning  horizon  of  n  periods.  In  some  situations  it  is  of  interest  to  find  the 
optimal  decision  rule  for  an  infinite  planning  period.  Moreover,  in  practice 
the  parameter  matrices  A,  B,  and  C  are  usually  unknown  and  have  to  be 
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replaced  by  estimators.  More  generally,  stochastic  parameter  models  may  be 
considered.  This,  of  course,  introduces  an  additional  stochastic  element  into 
the  optimal  decision  rule.  A  further  complication  arises  if  the  relations  be¬ 
tween  the  variables  cannot  be  captured  adequately  by  a  linear  model  such  as 
(10.7.2)  but  require  a  nonlinear  specification.  It  is  also  possible  to  consider 
other  types  of  optimization  rules.  In  this  section,  we  have  assumed  that  the 
optimal  decision  rule  for  period  T  +  i  is  determined  on  the  basis  of  all  avail¬ 
able  information  in  period  T  +  i  —  1.  In  particular,  the  realization  Yr+i-i  is 
assumed  to  be  given  in  setting  up  the  decision  rule  x^+i.  Such  an  approach 
is  often  referred  to  as  a  closed-loop  strategy.  An  alternative  approach  would 
be  to  determine  the  decision  rule  at  the  beginning  of  the  planning  period  for 
the  entire  planning  horizon.  This  approach  is  called  an  open-loop  strategy. 
Although  it  is  in  general  inferior  to  closed-loop  optimization,  it  may  be  of 
interest  occasionally.  These  and  many  other  topics  are  treated  in  the  opti¬ 
mal  control  literature.  Chow  (1975,  1981)  and  Murata  (1982)  are  books  on 
the  topic  with  emphasis  on  optimal  decision  making  related  to  economic  and 
econometric  models.  Friedmann  (1981)  provided  the  asymptotic  properties  of 
the  optimal  decision  rule  when  estimators  are  substituted  for  the  parameters 
in  the  control  rule. 
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In  this  chapter,  we  have  summarized  some  problems  related  to  the  estimation, 
specification,  and  analysis  of  dynamic  models  with  unmodelled  variables.  Ma¬ 
jor  problem  areas  that  were  identified  without  giving  details  of  possible  so¬ 
lutions  are  the  distinction  between  endogenous  and  exogenous  variables,  the 
identification  or  unique  parameterization  of  dynamic  models,  the  estimation, 
specification,  and  checking  of  structural  form  models  as  well  as  the  treatment 
of  nonlinear  specifications.  Also,  we  have  just  scratched  the  surface  of  con¬ 
trol  problems  which  represent  one  important  area  of  applications  of  dynamic 
SEMs. 

Other  problems  of  obvious  importance  in  the  context  of  these  models  re¬ 
late  to  the  choice  of  the  data  associated  with  the  variables.  If  a  structural 
form  is  derived  from  some  economic  or  other  subject  matter  theory,  it  is  im¬ 
portant  that  the  available  data  represents  realizations  of  the  variables  related 
to  the  theory.  In  particular,  the  level  of  aggregation  (temporal  and  contem¬ 
poraneous)  and  seasonal  characteristics  (seasonally  adjusted  or  unadjusted) 
may  be  of  importance.  The  models  we  have  considered  do  not  allow  specifi¬ 
cally  for  seasonality,  except  perhaps  for  seasonal  dummies  and  other  seasonal 
components  among  the  unmodelled  variables.  The  seasonality  aspect  in  the 
context  of  dynamic  SEMs  and  models  specifically  designed  for  seasonal  data 
were  discussed,  for  example,  by  Hylleberg  (1986). 

So  far,  we  have  essentially  considered  stationary  and  integrated  processes. 
Mild  deviations  from  the  stationarity  assumption  are  possible  in  dynamic 
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SEMs  where  unmodelled  variables  may  cause  changes  in  the  mean  or  con¬ 
ditional  mean  of  the  endogenous  variables.  However,  in  discussing  properties 
of  estimators  or  long-run  multipliers,  we  have  made  assumptions  that  come 
close  to  assuming  stationarity  or  cointegration.  For  instance,  if  the  unmod¬ 
elled  variables  are  driven  by  a  stationary  VAR  process,  the  means  and  second 
moments  of  the  endogenous  variables  may  be  time  invariant.  Unfortunately, 
in  practice,  changes  in  the  data  generation  process  may  occur.  Therefore,  we 
will  discuss  specific  types  of  models  with  time  varying  parameters  in  later 
chapters  (see  Chapters  17  and  18). 


10.9  Exercises 

Problem  10.1 

Consider  the  following  structural  form 

Qt=a0  +  a±Rt-i  +  w\t, 

Pt  =  A)  +  Pi  Qt  +  W2  *, 

where  Rt  is  a  measure  for  the  rainfall  in  period  t,  Qt  is  the  quantity  of  an 
agricultural  product  supplied  in  period  t,  and  Pt  is  the  price  of  the  product. 
Derive  the  reduced  form,  the  final  equations,  and  the  final  form  of  the  model. 

Problem  10.2 

Suppose  that  the  rainfall  variable  Rt  in  Problem  10.1  is  generated  by  a  white 
noise  process  with  mean  hr-  Determine  the  unconditional  3-step  ahead  fore¬ 
casts  for  Qt  and  Pt  based  on  the  model  from  Problem  10.1.  Determine  also 
the  conditional  3-step  ahead  forecasts  given  Rt+i  =  HRi  i  =  1,2,3.  Compare 
the  two  forecasts. 

Problem  10.3 

Given  the  model  of  Problem  10.1,  what  is  the  marginal  total  or  long-run  effect 
of  an  additional  unit  of  rainfall  in  period  f? 

Problem  10. 4 

Suppose  the  system  yt  has  the  structural  form 
A*(L)yt  =  F*ye  +  B*(L)xt  +  wt, 

where  A*(L)  :=  A  -  A\L - A*LP,  B*(L)  :=  B*0  +  B{L  +  •  •  •  +  B*LS  and 

xt  is  generated  by  a  VAR.(q)  process 


C(L)xt  =  vt. 


Assume  that  y t  represents  rational  expectations  formed  in  period  t  —  1  and 
eliminate  the  expectations  variables  from  the  structural  form. 
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Problem  10.5 

Show  that  the  1-step  ahead  forecast  for  yt  obtained  from  the  VAR(p)  model 
(10.5.3)  is  identical  to  the  one  determined  from  (10.5.2)  if 

*Ef(l)  —  G\X^  “t“  ’  '  '  "t“  Pq%t—  q+1 

is  used  as  forecast  for  the  exogenous  variables. 

Problem  10.6 

Show  that  the  partial  derivatives  dyt(h\x) / d(3'  have  the  form  given  in  (10.5.15). 
Problem  10.7 

Derive  a  prediction  test  for  structural  change  on  the  basis  of  the  conditional 
forecasts  of  the  endogenous  variables  of  a  dynamic  SEM. 

Problem  10.8 

Show  that  the  dynamic  multipliers  have  the  asymptotic  distributions  given  in 
Section  10.6.  Show  also  that  the  n-th  interim  multipliers  have  an  asymptotic 
normal  distribution, 

Vfvec (M„  -  Mn)  Af(0, 27a(n)), 
where 

'S'm(^)  =  (G0  +  •  •  •  +  Gn)£p(Go  +  •  •  •  +  Gn)' 

and  the  Gi  are  the  [KM  x  K (Kp  +  M(s  +  1))]  matrices  defined  in  Section 
10.6.  Furthermore, 

Vt vec(M00  -  Moo)  Af{0,  ^fh(oo)), 
where 

'S'm(oo)  =  GooR’^G'^ 
with 

Goo  :=  [((/  -  A)_1B)/  :  IM]  ®  J(I  -  A)-1  J'. 

Here  the  notation  from  Section  10.6  is  used. 

Problem  10.9 

Derive  the  optimal  decision  rule  for  the  control  problem  stated  in  Section  10.7. 
(Hint:  See  Chow  (1975).) 
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So  far  we  have  considered  finite  order  VAR  processes.  A  more  flexible  and 
perhaps  more  realistic  class  of  processes  is  obtained  by  allowing  for  an  infi¬ 
nite  VAR  order.  Of  course,  having  only  a  finite  string  of  time  series  data,  the 
infinitely  many  VAR  coefficients  cannot  be  estimated  without  further  assump¬ 
tions.  There  are  two  competing  approaches  that  have  been  used  in  practice 
in  order  to  overcome  this  problem.  In  one  approach,  it  is  assumed  that  the 
infinite  number  of  VAR  coefficients  depend  on  finitely  many  parameters.  In 
Chapter  11,  vector  autoregressive  moving  average  (VAR.MA)  processes  are  in¬ 
troduced  that  may  be  viewed  as  finite  parameterizations  of  potentially  infinite 
order  VAR  processes.  Estimation  and  specification  of  these  processes  are  dis¬ 
cussed  in  Chapters  12  and  13,  respectively.  Cointegrated  VAR.MA  processes 
are  considered  in  Chapter  14.  In  Chapter  15,  another  approach  is  pursued.  In 
that  approach,  the  infinite  order  VAR  operator  is  truncated  at  some  finite  lag 
and  the  resulting  finite  order  VAR  model  is  estimated.  It  is  assumed,  however, 
that  the  truncation  point  depends  on  the  time  series  length  available  for  esti¬ 
mation.  A  suitable  asymptotic  theory  for  the  resulting  estimators  is  discussed 
both  for  stationary  as  well  as  cointegrated  processes. 


11 


Vector  Autoregressive  Moving  Average 
Processes 


11.1  Introduction 

In  this  chapter,  we  extend  our  standard  finite  order  VAR  model, 

Ut  =  v  +  Aiyt_i  +  •  •  •  +  Apyt_p  +  et , 

by  allowing  the  error  terms,  here  et,  to  be  autocorrelated  rather  than  white 
noise.  The  autocorrelation  structure  is  assumed  to  be  of  a  relatively  simple 
type  so  that  £(  has  a  finite  order  moving  average  (MA)  representation, 


£t  —  Ut  +  MiUt-l  +  •  •  •  +  MqUt—q, 

where,  as  usual,  itt  is  zero  mean  white  noise  with  nonsingular  covariance  ma¬ 
trix  Eu.  A  finite  order  VAR  process  with  finite  order  MA  error  term  is  called 
a  VARMA  ( vector  autoregressive  moving  average )  process. 

Before  we  study  VARMA  processes  in  general,  we  will  discuss  some  prop¬ 
erties  of  finite  order  MA  processes  in  Section  11.2.  In  Section  11.3,  we  consider 
the  more  general  stationary  VARMA  processes  with  stable  VAR  part  and  we 
will  learn  that  generally  they  have  infinite  order  pure  VAR  and  MA  repre¬ 
sentations.  Their  autocovariance  and  autocorrelation  properties  are  treated  in 
Section  11.4  and  forecasting  VARMA  processes  is  discussed  in  Section  11.5. 
In  Section  11.6,  transforming  and  aggregating  these  processes  is  considered. 
In  that  section,  we  will  see  that  a  linearly  transformed  finite  order  VAR(p) 
process,  in  general,  does  not  admit  a  finite  order  VAR  representation  but 
becomes  a  VARMA  process.  Because  transformations  of  variables  are  quite 
common  in  practice,  this  result  is  a  powerful  argument  in  favor  of  the  more 
general  VARMA  class.  Finally,  Section  11.7  contains  discussions  of  causal¬ 
ity  issues  and  impulse  response  analysis  in  the  context  of  VARMA  systems. 
Throughout  this  chapter,  we  consider  stationary  processes  only. 
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11.2  Finite  Order  Moving  Average  Processes 

In  Chapter  2,  we  have  encountered  MA  processes  of  possibly  infinite  order. 
Specifically,  we  have  seen  that  stationary,  stable  finite  order  VAR  processes 
can  be  represented  as  MA  processes.  Now  we  deal  explicitly  with  finite  or¬ 
der  MA  processes.  Let  us  begin  with  the  simplest  case  of  a  AT-dimensional 
MA  process  of  order  1  (MA(1)  process),  yt  =  y  +  ut  +  M\ut- 1,  where 
Ut  =  {yiti  ■  ■  ■  iUKt)' yut  is  zero  mean  white  noise  with  nonsingular  covariance 
matrix  Eu,  and  y  =  (yi, . . .  ,/jLk)'  is  the  mean  vector  of  yt,  i.e.,  E(yt)  =  y 
for  all  t.  For  notational  simplicity  we  will  assume  in  the  following  that  y  =  0, 
that  is,  yt  is  a  zero  mean  process.  Thus,  we  consider 

yt  =  ut  +  Mi‘Ut-1,  t  =  0,±1,±2, . . . ,  (11.2.1) 

which  may  be  rewritten  as 


ut  =  yt  ~  MiUt-i. 


By  successive  substitution  we  get 

ut  =  yt  -  -  M-tUt-2)  =  Vt  ~  M-iyt-i  +  M12ut_2 

=  '  '  '  =  Vt  —  MlVt.-l  +  •  •  •  +  (—Mi)nyt-n  +  (  —  Mi)n+1  Ut-n-l 

OO 

=  yt  +  ^(-MiYyt-i, 

1=1 

if  — >  0  as  i  — >  oo.  Hence, 

OO 

Vt  =  ~  Yli-MiYvt-i  +  uu  (H-2.2) 

i= 1 

which  is  the  potentially  infinite  order  VAR  representation  of  the  process.  Be¬ 
cause  (—All)1  may  be  equal  to  zero  for  i  greater  than  some  finite  number  p , 
the  process  may  in  fact  be  a  finite  order  VAR(p).  For  instance,  we  get  p  =  1 
for  a  bivariate  process  with 


where  m  is  some  nonzero  real  number. 

For  the  representation  (11.2.2)  to  be  meaningful,  M{  must  approach  zero 
as  i  — >  00,  which  in  turn  requires  that  the  eigenvalues  of  M\  are  all  less  than 
1  in  modulus  or,  equivalently, 

det {Ik  +  M-^z)  ^  0  for  z  £  C,  \z\  <  1. 


This  condition  is  analogous  to  the  stability  condition  for  a  VAR(l)  process. 
It  guarantees  that  the  infinite  sum  in  (11.2.2)  exists  as  a  mean  square  limit. 


11.2  Finite  Order  Moving  Average  Processes  421 


More  generally,  it  can  be  shown  that  a  (zero  mean)  MA (q)  process  (moving 
average  process  of  order  q ), 

Ut  =  ut  +  MiUt—i  +  •  •  •  +  MqUf-q,  t  =  0,  ±1,  ±2, . . . ,  (11.2.3) 


has  a  pure  VAR  representation 

OO 

Vt  =  Y  uiVt-i  +  Uf--  (11.2.4) 

i= 1 


if 


det (Ik  +  Mi z  +  ■  ■  ■  +  Mqzq)  ^  0  for  z  £  C,  \z\  <  1.  (11.2.5) 

An  MA(g)  process  with  this  property  is  called  invertible  in  the  following  be¬ 
cause  we  can  invert  from  the  MA  to  a  VAR  representation.  Writing  the  process 
in  lag  operator  notation  as 


Ut  —  (Ik  +  M\  L  +  •  •  •  +  MqLq)ut  —  M(L)ut 


the  MA  operator  M(L)  :=  Ik  +  M\L  +  •  •  •  +  MqLq  is  invertible  if  it  satisfies 
(11.2.5)  and  we  may  formally  write 

M(L)~1yt  =  ut. 

The  actual  computation  of  the  coefficient  matrices  llt  in 

OO 

M(L)-1  =  II(L)  =  1K 

i= 1 

can  be  done  recursively  using  II i  =  M\  and 
i- 1 

lit  =  Mt  —  'Y  Hi-jMj,  i  =  2,3,...,  (11.2.6) 

3=  1 

where  M:j  :=  0  for  j  >  q.  These  recursions  follow  immediately  from  the 
corresponding  recursions  used  to  compute  the  MA  coefficients  of  a  pure  VAR 
process  (see  Chapter  2,  (2.1.22)). 

The  autocovariances  of  the  MA(g)  process  (11.2.3)  are  particularly  easy  to 
obtain.  They  follow  directly  from  those  of  an  infinite  order  MA  process  given 
in  Chapter  2,  Section  2.1.2,  (2.1.18): 

(  E  M.l+hSuM[ ,  h  =  0,l,...,q, 
ry(h)  =  E(yty't_h)  =  <=o 


0, 


h  -  q  ■  \.q~  2, . , 


(11.2.7) 
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with  My  :=  IK.  As  before,  ry{—h)  =  l'y{h)' .  Thus,  the  vectors  yt  and  yt-h 
are  uncorrelated  if  h  >  q.  Obviously,  the  process  (11.2.3)  is  stationary  because 
the  ry{h)  do  not  depend  on  t  and  the  mean  E(yt)  =  0  for  all  t. 

It  can  be  shown  that  a  noninvertible  MA(g)  process  violating  (11.2.5)  also 
has  a  pure  VAR  representation  if  the  determinantal  polynomial  in  (11.2.5) 
has  no  roots  on  the  complex  unit  circle,  i.e.,  if 

det(IK  +  MlZ  +  ■  ■  ■  +  Mqzq)  ^  0  for  \z\  =  1.  (11.2.8) 

The  VAR  representation  will,  however,  not  be  of  the  type  (11.2.4)  in  that  the 
white  noise  process  will  in  general  not  be  the  one  appearing  in  (11.2.3).  The 
reason  is  that  for  any  noninvertible  MA(g)  process  satisfying  (11.2.8),  there 
is  an  equivalent  invertible  MA(q)  satisfying  (11.2.5)  which  has  an  identical 
autocovariance  structure  (see  Hannan  &  Deistler  (1988,  Chapter  1,  Section 
3)).  For  instance,  for  the  univariate  MA(1)  process 


yt  =  ut  +  mut-i,  (11.2.9) 

the  invertibility  condition  requires  that  1  +  mz  has  no  roots  for  \z\  <  1  or, 
equivalently,  \m\  <  1.  For  any  m,  the  process  has  autocovariances 

f  (1  +  m2)a 2  for  h  =  0, 

E(ytyt-h)  =  <  ma2u  for  h  =  ±1, 

[  0  otherwise, 

where  a2  :=  Var (wt).  It  is  easy  to  check  that  the  process  vt  +  EVt_ly  where 
vt  is  a  white  noise  process  with  a2  :=  Var (vt)  =  in2a2,  has  the  very  same 
autocovariance  structure.  Thus,  if  \m\  >  1,  we  may  choose  the  invertible 
MA(1)  representation 

yt  =  vt+  —vt- i  (11.2.10) 

m 

with 

=  flH - L )  (1  +  mL)ut- 

V  m  J 

The  reader  is  invited  to  check  that  vt  is  indeed  a  white  noise  process  with 
< t 2  =  in2 a2  (see  Problem  11.10).  Only  if  \m\  =  1  and,  hence,  1  +  mz  =  0  for 
some  z  on  the  unit  circle  (z  =  1  or  —  1),  an  invertible  representation  does  not 
exist. 

Although  for  higher  order  and  higher-dimensional  processes,  where  roots 
inside  and  outside  the  unit  circle  may  exist,  it  is  more  complicated  to  find  the 
invertible  representation,  it  can  be  done  whenever  (11.2.8)  is  satisfied.  In  the 
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remainder  of  this  chapter,  we  will  therefore  assume  without  notice  that  all 
MA  processes  are  invertible  unless  stated  otherwise.  It  should  be  understood 
that  this  assumption  implies  a  slight  loss  of  generality  because  MA  processes 
with  roots  on  the  complex  unit  circle  are  excluded. 


11.3  VARMA  Processes 

11.3.1  The  Pure  MA  and  Pure  VAR  Representations  of  a 
VARMA  Process 

As  mentioned  in  the  introduction  to  this  chapter,  allowing  finite  order  VAR 
processes  to  have  finite  order  MA  instead  of  white  noise  error  terms,  results  in 
the  broad  and  flexible  class  of  vector  autoregressive  moving  average  (VARMA) 
processes.  The  general  form  of  a  process  from  this  class  with  VAR  order  p  and 
MA  order  q  is 

yt  =  v  +  Apyt-i  +  ■  •  •  +  Apyt-V  +  ut  +  MiUt-i  +  •  •  •  +  Mqut-q , 

f  =  0,  ±1,  ±2, . . . .  (11.3.1) 

Such  a  process  is  briefly  called  a  VARMA  (p,  q)  process.  As  before,  ut  is  zero 
mean  white  noise  with  nonsingular  covariance  matrix  Su. 

It  may  be  worth  elaborating  a  bit  on  this  specification.  What  kind  of 
process  yt  is  defined  by  the  VARMA  (p,  q)  model  (11.3.1)?  To  look  into  this 
question,  let  us  denote  the  MA  part  by  et,  that  is,  et  =  ut  +  ut_ i  +  •  •  •  + 
Mqut-q  and 


yt  —  v  +  Aiyt-i  +  •  •  •  +  Apyt-p  +  Et- 

If  this  process  is  stable,  that  is,  if 


det (Ik  —  A\z  —  ■  ■  ■  —  Apzp )  ^  0  for  \z\  <  1,  (11.3.2) 

then,  by  the  same  arguments  used  in  Chapter  2,  Section  2.1.2,  and  by  Propo¬ 
sition  C.9  of  Appendix  C.3, 

OO 

yt  =  m  +  ^2  Di£t~i 

i= 0 

oo 

=  M  Di(ut-i  +  MiUt-i-l  +  •  •  •  +  MqUt-i-q) 

2—0 

OO 

=  /x  +  $jUt-i  (11.3.3) 

2—0 

is  well-defined  as  a  limit  in  mean  square,  given  a  well-defined  white  noise 
process  ut.  Here 
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li  ■=  (Ik  ~  A1 - Ap)  lv , 

the  Di  are  ( K  x  K)  matrices  satisfying 

DO 

YJDizi  =  (lK-A1z - ApZP)-\ 

i= 0 

and  the  <Pi  are  (K  x  K)  matrices  satisfying 

oo  /  oo  \ 

&i zl  =  [  DiZ 1  J  (Ik  +  M\Z  +  •  •  •  +  Mqzq). 

i= 0  \i=0  / 

In  the  following,  when  we  call  yt.  a  stable  VARMAfp,  q)  process,  we  mean  the 
well-defined  process  given  in  (11.3.3).  For  instance,  if  Ut  is  Gaussian  white 
noise,  it  can  be  shown  that  yt  is  a  Gaussian  process  with  all  finite  subcol¬ 
lections  of  vectors  ytl ... ,  yt+h  having  joint  multivariate  normal  distributions. 
The  representation  (11.3.3)  is  a  pure  MA  or  simply  MA  representation  of  yt- 
To  make  the  derivation  of  the  MA  representation  more  transparent,  let  us 
write  the  process  (11.3.1)  in  lag  operator  notation, 

A(L)yt  =  is  +  M(L)ut,  (11.3.4) 

where  A(L)  :=  Ik  —  A\L  —  ■  ■  •  —  AVLP  and  M (L)  :=  Ik  +  M\L  +  ■  — I-  MqLq. 
A  pure  MA  representation  of  yt  is  obtained  by  premultiplying  with  A(L)^X , 

OO 

yt  =  24(1)_1z/  +  24(L)_1M(L)^  =  /z  + 

2—0 

Hence,  multiplying  from  the  left  by  A(L)  gives 

(OO 

Y,(p'Ll 

2—0 

OO  /  2  \ 

=  ^+E 

z=l  \  j-l  ) 

=  IK  +  MiL  +  •  •  •  +  MqLq 

and,  thus,  comparing  coefficients  results  in 
2 

Mi  =  <Pi  -  *  =  1, 2,  •  ■  • , 

f=i 

with  •=  Ik,  Aj  :=  0  for  j  >  p ,  and  Mi  :=  0  for  i  >  q.  Rearranging  terms 
gives 
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4>t  =  M,:  +  A/P,-,,  *  =  1,2, _ 


(11.3.5) 


If  the  MA  operator  M(L)  satisfies  the  invertibility  condition  (11.2.5),  then 
the  VARMA  process  (11.3.4)  is  called  invertible.  In  that  case,  it  has  a  pure 
VAR  representation, 

OO 

Vt  -  HiVt-i  =  M(L)-1A(L)yt  =  +  uu 

i= 1 

and  the  lli  matrices  are  obtained  by  comparing  coefficients  in 

OO 

IK  =  M{L)~XA{L). 

i=l 

Alternatively,  multiplying  this  expression  from  the  left  by  M(L)  gives 


(- Ik  +  MiL  +  •  •  •  +  MqLq)  I K  -  ^  II,  L 


=  1k+J2  L' 


—  Ik  ~  Ai L  —  ■  ■  ■  —  ApLp , 

where  M0  :=  IK  and  Mi  :=  0  for  *  >  q.  Setting  A,  :=  0  for  *  >  p  and 
comparing  coefficients  gives 


-At  M,  Y,  Mi-,11,  -  Hi 


1 lt  =  Ai  +  Mt  -  ^  Mi-jllj  for  *  =  1,2, - 


(11.3.6) 


As  usual,  the  sum  is  defined  to  be  zero  if  the  lower  bound  for  the  summation 
index  exceeds  its  upper  bound. 

For  instance,  for  the  zero  mean  VARMA (1, 1)  process 


Vt  =  Aiyt_i  +  ut  + 

we  get 

lli  =  A  i  +  Mi 

1 1 —  .4  2  +  A  /  2  —  A/  |  / 1\  ~  — A  / 1 .4 1  —  A  /  ( 


(11.3.7) 


11,  =  (-1  y-^Ml  +  M^A,),  i=  1,2,..., 
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and  the  coefficients  of  the  pure  MA  representation  are 
<£o  =  Ik 
ll)  ]  =  Mi 

<?2  =  M2  +  Ai<Pi  +  A.2^0  =  Ai(Mi  +  A\) 

<p,.  =  Al1M1+A\,  i  =  1,2,.... 

If  is  a  stable  and  invertible  VARMA  process,  then  the  pure  MA  represen¬ 
tation  (11.3.3)  is  called  the  canonical  or  prediction  error  MA  representation, 
in  accordance  with  the  terminology  used  in  the  finite  order  VAR  case.  In  ad¬ 
dition  to  the  pure  MA  and  VAR  representations  considered  in  this  section,  a 
VARMA  process  also  has  VAR(l)  representations.  One  such  representation  is 
introduced  next. 


11.3.2  A  VAR(l)  Representation  of  a  VARMA  Process 


Suppose  yt  has  the  VARMA(p,  q)  representation  (11.3.1).  For  simplicity,  we 
assume  that  its  mean  is  zero  and,  hence,  v  =  0.  Let 


Yt  ■■= 


ut 

1 

yt 

0 

1 

yt.-p+i 

Ut  := 

0 

t 

ut 

ut 

'■ 

0 

l 

_ut.-q+ 1. 

.oj 

(^(P+9)X1) 

j 

( Kp  x  1) 


(. Kq  x  1) 


and 
A  := 
where 

An  := 


An  A12 
A21  A22 


[K(jp  +  q)  x  K(p  +  q)], 


A12 


A 1 
Ik 

0 

Mi 

0 


Ap- 1  Av 

0  0 


..  1K 

(. KpxKp ) 


0 


AIq-1  Mq 
0  0 


0 

( KpxKq ) 


11.3  VARMA  Processes  427 


0  ...  0  0  ' 

1K  0  0 

0  ...  I K  0. 

( KqxKq ) 

With  this  notation,  we  get  the  VAR(l)  representation  of  Yt, 


A2i  0,  A22  := 

( KqxKp ) 


Yt  =  A  yt_!  +  Ut.  (11.3.8) 

If  the  VAR  order  is  zero  (p  =  0),  we  choose  p  =  1  and  set  A\  =  0  in  this 
representation. 

The  K(p  +  (^-dimensional  VAR(l)  process  in  (11.3.8)  is  stable  if  and  only 
if  yt  is  stable.  This  result  follows  because 

det(/K(p+</)  -  Ac)  =  clet (IKp  -  An z)  det (IKq  -  A 22z) 

=  det(l  k  —  Aiz —■■■  — Apzp).  (11.3.9) 

Here  the  rules  for  the  determinant  of  a  partitioned  matrix  from  Appendix  A.  10 
have  been  used  and  we  have  also  used  that  Ixq  —  A 22c  is  a  lower  triangular 
matrix  with  ones  on  the  main  diagonal  which  has  determinant  1.  Furthermore, 
det(lKp  —  Anz)  =  clet(7A'  —  Aiz  —  ■  ■  ■  —  Apzp)  follows  as  in  Section  2.1.1. 

From  Chapter  2,  we  know  that  if  yt  and,  hence,  Yt  is  stable,  the  latter 
process  has  an  MA  representation 

OO 

Yt  =  YJAiut-i ■ 

i= 0 

Pre multiplying  by  the  ( K  x  K(p  +  q))  matrix  J  :=  )//\  :():•••  :  0]  gives 

OO  OO  OO  OO 

!Jt  =  Y,  J  A.'  H  JUt-i  =  Y,  J^Hut-i  = 

i= 0  i—0  2—0  2=0 


where 


H  = 


'Ik 

0 

0 

Ik 

0 

Loj 

}  ( Kp  x  K) 


}  ( Kq  x  K) 


Thus, 


=  JA'H. 


(11.3.10) 


428  11  Vector  Autoregressive  Moving  Average  Processes 

As  an  example,  consider  the  zero  mean  VARMA(1, 1)  process  from  (11.3.7), 

Vt  =  Myt-i  +  ut  +  MiUt-i- 


For  this  process 


= 

yt 

,  A  = 

'  A\  Ml  " 

,  Ut  = 

Ut 

ut 

0  0 

ut 

J  =  [IK  :  0]  (K  x  2 K), 

and 


H  = 


Ik 

Ik 


(2 K  x  K). 


Hence, 

$0 

$1 

&  2 


jh  =  ik , 

JAH  =  [Ax  :  Mi]H  =  A-t  +  Mu 


JAZH  =  J 


JA‘H  =  J 


Af  A\AI\ 
0  0 


H  —  Az  +  .1 1 1  /  j . 


(11.3.11) 


A\  A\~  Mi 
0  0 


H  =  A\  +  A\~1M1,  i  =  1,2,.... 


This,  of  course,  is  precisely  the  same  formula  obtained  from  the  recursions  in 
(11.3.5). 

The  foregoing  method  of  computing  the  MA  matrices  is  just  another  way 
of  computing  the  coefficient  matrices  of  the  power  series 


Ik  +  L'  =  i1*  -A^L - ApLP)-\Ik  +  MiL  +  •  •  •  +  MqLq). 

2—1 


Therefore,  it  can  just  as  well  be  used  to  compute  the  77,;  coefficient  matrices 
of  the  pure  VAR  representation  of  a  VARMA  process.  Recall  that 

OO 

Ik~J2  lI'Ll  =  (Ik  +  Af[L  +  ' ' '  +  MqLq)-\lK  -Ail - APLP). 

2=1 

Hence,  if  we  define 


Mu  M12 

M21  M22 


(11.3.12) 


where 
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Mn  := 


M12  := 


M21  0,  M22 

(. KpxKq ) 


we  get 

—III  =  JMlH 
with 


—Mi  ...  — M9_  1  -Mq 
1K  0  0 


0  ...  IK  0 

(• KqxKg ) 

A\  . . .  Ap—i  Ap 

0  ...  0  0 


0  0 

( KqxKp ) 

0  ...  0  0 
IK  0  0 


0  ...IK  0 

(KpxKp) 


H  := 


1 

0 

1 

, 

0 

Ik 

0 

.  0. 

t 

(Kq  x  K) 


}  ( Kp  x  K) 


(11.3.13) 


11.4  The  Autocovariances  and  Autocorrelations  of  a 
VARMA(p,  q)  Process 

For  the  if-dimensional,  zero  mean,  stable  VARMAfp,  q)  process 

Ut  =  Aiyt-i  +  •  •  •  +  Api/t-p  +  Ut  +  M\Ut-\  +  •  •  •  +  MqUt~q ,  (11.4.1) 

the  autocovariances  can  be  obtained  formally  from  its  pure  MA  representation 
as  in  Section  2.1.2.  For  instance,  if  yt  has  the  canonical  MA  representation 

OO 

yt  = 

i= 0 
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the  autocovariance  matrices  are 

OO 

ry(h)  :=  E(yty't_h)  =  '^2$h+i£u&i. 

»= o 

For  the  actual  computation  of  the  autocovariance  matrices,  the  following 
approach  is  more  convenient.  Postmultiplying  (11.4.1)  by  y't_h  and  taking 
expectations  gives 

E(yty't_h )  =  A1E(yt_1y't_h)  +  •  •  •  +  ApE(yt_py't_h)  +  E(uty't_h)  +  ■  ■  ■ 
+MqE(ut-qy't_h). 

From  the  pure  MA  representation  of  the  process,  it  can  be  seen  that  E(uty's)  = 
0  for  s  <  t.  Hence,  we  get  for  h  >  q, 

rv(h)  =  A1ry(h-i)  +  ---  +  Apry(h-P).  (11.4.2) 

If  p  >  q  and  ry(Q), . . .  ,ry(j>  —  1)  are  available,  this  relation  can  be  used  to 
compute  the  autocovariances  recursively  for  h  =  p,p  +  1, . . .  . 

The  initial  matrices  can  be  obtained  from  the  VAR(l)  representation 
(11.3.8),  just  as  in  Chapter  2,  Section  2.1.4.  In  that  section,  we  obtained 
the  relation 


1V(0)  =  Aly^A'  +  Eu 


(11.4.3) 


for  the  covariance  matrix  of  the  VAR(l)  process  Y*.  Here  Ejj  =  E(UtUj.)  is 
the  covariance  matrix  of  the  white  noise  process  in  (11.3.8).  Applying  the  vec 
operator  to  (11.4.3)  and  rearranging  terms  gives 

vec  ry(0)  =  (IK 2(p+g)2;  -  A  <g>  A)-1  vec(Eu),  (11.4.4) 


where  the  existence  of  the  inverse  follows  again  from  the  stability  of  the  pro¬ 
cess,  as  in  Section  2.1.4,  by  appealing  to  the  determinantal  relation  (11.3.9). 

Having  computed  rY (0)  as  in  (11.4.4),  we  may  collect  ry( 0), . . . ,  ry(jp—l) 
from 


M  o) 


rn(o)  r12(o) ' 
ri2(oy  r22(o)  J  ’ 


where 


rn(o) 


ry( o)  ry( i) 

Ey{- 1)  Ey{  o) 


rv(-P  + 1)  ry(-P  +  2) 


ry(P- 1) 
ry(p-  2) 

ry(  o) 
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’  E{ytu't)  E(ytu’t_1)  •••  E(ytu't_q+1 ) 

,  x  o  E(yt-iu't-i)  ■■■  E(yt- iu't-q+i) 

rt2(°)=  .  .  . 

0  0  ...  E(yt-p+i  ut_q+1) 

and 

Zu  0  ...  0 

0  Zu  0 

r22(°)=  .  ... 

0  0  ...  Zu 

As  mentioned  previously,  the  recursions  (11.4.2)  are  valid  for  h  >  q  only. 
Thus,  this  way  of  computing  the  autocovariances  requires  that  p  >  q.  If  the 
VAR  order  is  less  than  q ,  then  it  may  be  increased  artificially  by  adding  lags  of 
yt  with  zero  coefficient  matrices  until  the  VAR  order  p  exceeds  the  MA  order  q. 
Then  the  aforementioned  procedure  can  be  applied.  A  computationally  more 
efficient  method  of  computing  the  autocovariances  of  a  VAR.MA  process  is 
described  by  Mittnik  (1990). 

The  autocorrelations  of  a  VARMA(p,  q)  process  are  obtained  from  its  au¬ 
tocovariances  as  in  Chapter  2,  Section  2.1.4.  That  is, 

Ry(h)  =  D-1ry{h)D~1,  (11.4.5) 

where  D  is  a  diagonal  matrix  with  the  square  roots  of  the  diagonal  elements 
of  ly{0)  on  the  main  diagonal. 

To  illustrate  the  computation  of  the  covariance  matrices,  we  consider  the 
VARMA(1, 1)  process  (11.3.7).  Because  p  =  q,  we  add  a  second  lag  of  yt  so 
that 


yt  =  +  Myt-i  +  ut  +  Miut-i 

with  A2  :=  0.  Thus,  in  this  case, 


r  y<  1 

'  A 1 

0 

Mi  ' 

Yt  = 

II 

< 

7 

Ik 

0 

0 

.  ut  j 

0 

0 

0 

ut 

'  Zu 

0 

Zu " 

Ut  = 

0 

,  Zv  =  0 

0 

0 

.  ut . 

Zu 

0 

Zu 

With  this  notation,  we  get  from  (11.4.4), 

‘  ry( 0)  ry(  1)  vu- 

vec  Ey(— 1)  ry{ 0)  0  =  (IyK2  —  A  <gi  A)  1  vec(Ejj)- 

E.u  0  Eu 


432  11  Vector  Autoregressive  Moving  Average  Processes 

Now,  because  we  have  the  starting-up  matrices  ry( 0)  and  Ay(l),  the  recursions 
(11.4.2)  may  be  applied,  giving 

ry(h)  =  A-i_ry{h-  1)  for  h  =  2,3, . . . . 

In  stating  the  assumptions  for  the  VARMAfp,  q)  process  at  the  beginning 
of  this  section,  invertibility  has  not  been  mentioned.  This  is  no  accident  be¬ 
cause  this  condition  is  actually  not  required  for  computing  the  autocovariances 
of  a  VARMA(p,  q)  process.  The  same  formulas  may  be  used  for  invertible  and 
noninvertible  processes.  On  the  other  hand,  the  stability  condition  is  essential 
here,  because  it  ensures  invertibility  of  the  matrix  I  —  A  A. 


11.5  Forecasting  VARMA  Processes 

Suppose  the  A-dimensional  zero  mean  VARMA(p,  q)  process 

Vt  =  AiVt-i  +  •  •  •  +  Apyt-p  +  Ut  +  +  •  •  •  +  Mqut~q  (11.5.1) 

is  stable  and  invertible.  As  we  have  seen  in  Section  11.3.1,  it  has  a  pure  VAR 
representation, 

OO 

yt  =  ^2  UiVt-i  +  (11.5.2) 

i= 1 

and  a  pure  MA  representation, 


OO 

Vt  =  Y*iUt-j.  (11.5.3) 

i= 0 

Formulas  for  optimal  forecasts  can  be  given  in  terms  of  each  of  these  repre¬ 
sentations. 

Assuming  that  Ut  is  independent  white  noise  and  applying  the  conditional 
expectation  operator  Et,  given  information  up  to  time  t,  to  (11.5.1)  gives  an 
optimal  h- step  forecast 


Aiyt(h  -  1)  - 
+MhUt 


Apyt(h-p) 

-  MqUt+h-q  for  h  <  q, 


yt(h)  =  {  ~t-lVlhUt-\ - 1-  lVlqUt+h-q  lOi  fl  ^  q,  (11.5.4) 

A\yt{h  —  1)  4 - 1-  Apyt(h~p)  ior  h  >  q, 

where,  as  usual,  yt.(j)  '■=  yt.+j  for  j  <  0.  Analogously,  we  get  from  (11.5.2), 


HiPt V1  -  *)>  (11.5.5) 

i=t 


and,  in  Chapter  2,  Section  2.2.2,  we  have  seen  that  the  optimal  forecast  in 
terms  of  the  infinite  order  MA  representation  is 
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Vt{h)  =  ^  & iut+h-i  =  ^2  ®h+iUt-i  (11.5.6) 

i=h  i— 0 

(see  (2.2.10)).  Although  in  Chapter  2  this  result  was  derived  in  the  slightly 
more  special  setting  of  finite  order  VAR  processes,  it  is  not  difficult  to  see  that 
it  carries  over  to  the  present  situation.  All  three  formulas  (11.5.4)-(11.5.6) 
result,  of  course,  in  equivalent  predictors  or  forecasts.  They  are  different  rep¬ 
resentations  of  the  linear  minimum  MSE  predictors  if  Ut.  is  uncorrelated  but 
not  necessarily  independent  white  noise. 

A  forecasting  formula  can  also  be  obtained  from  the  VAR(l)  representation 
(11.3.8)  of  the  VARMA(p,  q)  process.  From  Section  2.2.2,  the  optimal  fa-step 
forecast  of  a  VAR(l)  process  at  origin  t  is  known  to  be 

Yt(h)  =  A hYt  =  A Yt(h  -  1).  (11.5.7) 

Premultiplying  with  the  ( K  x  K(p  +  q))  matrix  J  :=  [1K  :  0  :  •  •  •  :  0]  results 
precisely  in  the  recursive  relation  (11.5.4)  (see  Problem  11.4). 

The  forecasts  at  origin  t  are  based  on  the  information  set 

fit  =  {ys|s  <  t}. 

This  information  set  has  the  drawback  of  being  unavailable  in  practice.  Usually 
a  finite  sample  of  yt  data  is  given  only  and,  hence,  the  Ut  cannot  be  determined 
exactly.  Thus,  even  if  the  parameters  of  the  process  are  known,  the  prediction 
formulas  (11.5.4)-(11.5.6)  cannot  be  used.  However,  the  invertibility  of  the 
process  implies  that  the  lli  coefficient  matrices  go  to  zero  exponentially  with 
increasing  i  and  we  have  the  approximation 

OO  71 

ihyt{h  liiVt{h  -  i) 

2=1  2=1 

for  large  n.  Consequently,  in  practice,  if  the  information  set  is 


{yi,---,yr}  (11.5.8) 

and  T  is  large,  then  the  forecast 
T+h- 1 

yr{h)=  ^2  UiyT(h-i ),  (11.5.9) 

2  =  1 

where  yr(j)  ■=  yr+j  for  j  <  0.  will  be  almost  identical  to  the  optimal  forecast. 
For  a  low  order  process,  as  it  is  commonly  used  in  practice,  for  which  the  roots 
of 


det  (Ik  +  M\z  +  •  •  •  +  Af^z^) 

are  not  close  to  the  unit  circle,  T  >  50  will  usually  result  in  forecasts  that  can¬ 
not  be  distinguished  from  the  optimal  forecasts.  It  is  worth  noting,  however, 
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that  the  optimal  forecasts  based  on  the  finite  information  set  (11.5.8)  can  be 
determined.  The  resulting  forecast  formulas  are,  for  instance,  given  by  Brock- 
well  &  Davis  (1987,  Chapter  11,  §11.4).  A  similar  problem  is  not  encountered 
in  forecasting  finite  order  VAR  processes  because  there  the  optimal  forecast 
depends  on  a  finite  string  of  past  variables  only. 

In  the  presently  considered  theoretical  setting,  the  forecast  MSE  matrices 
are  most  easily  obtained  from  the  representation  (11.5.6).  The  forecast  error 
is 

h-X 

yt+h  -  yt(h)  =  ^2  ®iut+h-i 

i= 0 

and,  hence,  the  forecast  MSE  matrix  turns  out  to  be 
Sy(h)  :=  E[(yt+h  -  yt(h))(yt+h  -  yt{h))'} 

h- 1 

=  (11.5.10) 

i—0 

as  in  the  finite  order  VAR  case.  Note,  however,  that,  in  the  present  case,  the  Al, 
coefficient  matrices  enter  in  computing  the  i  matrices.  Because  the  forecasts 
are  unbiased,  that  is,  the  forecast  errors  have  mean  zero,  the  MSE  matrix  is 
the  forecast  error  covariance  matrix.  Consequently,  if  the  process  is  Gaussian, 
i.e.,  for  all  t  and  h,yt, . . .  ,yt+h  have  a  multivariate  normal  distribution  and 
also  the  ut's  are  normally  distributed,  then  the  forecast  errors  are  normally 
distributed, 

Vt+h  -  Vt(h)  ~  Af( 0,  Ey{h)).  (11.5.11) 

This  result  may  be  used  in  the  usual  fashion  in  setting  up  forecast  intervals. 

If  a  process  with  nonzero  mean  vector  y  is  considered,  the  mean  vector 
may  simply  be  added  to  the  prediction  formula  for  the  mean-adjusted  process. 
For  example,  if  yt  has  zero  mean  and  a '+  =  yt  +  y,  then  the  optimal  h- step 
forecast  of  xt  is 

xt(h)  =  yt(h)  +  y. 

The  forecast  MSE  matrix  is  not  affected,  that  is,  Sx(h)  =  Uy(h). 

11.6  Transforming  and  Aggregating  VARMA  Processes 

In  practice,  the  original  variables  of  interest  are  often  transformed  before 
their  generation  process  is  modelled.  For  example,  data  are  often  seasonally 
adjusted  prior  to  an  analysis.  Also,  sometimes  they  are  temporally  aggre¬ 
gated.  For  instance,  quarterly  data  may  have  been  obtained  by  adding  up  the 
corresponding  monthly  values  or  by  taking  their  averages.  Moreover,  contem¬ 
poraneous  aggregation  over  a  number  of  households,  regions  or  sectors  of  the 
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economy  is  quite  common.  For  example,  the  GNP  (gross  national  product) 
value  for  some  period  is  the  sum  of  private  consumption,  investment  expen¬ 
ditures,  net  exports,  and  government  spending  for  that  period.  It  is  often  of 
interest  to  see  what  these  transformations  do  to  the  generation  processes  of 
the  variables  in  order  to  assess  the  consequences  of  transformations  for  fore¬ 
casting  and  structural  analysis.  In  the  following,  we  assume  that  the  original 
data  are  generated  by  a  VARMA  process  and  we  study  the  consequences  of 
linear  transformations.  These  results  are  of  importance  because  many  tempo¬ 
ral  as  well  as  contemporaneous  aggregation  procedures  can  be  represented  as 
linear  transformations. 


11.6.1  Linear  Transformations  of  VARMA  Processes 

We  shall  begin  with  the  result  that  a  linear  transformation  of  a  process  pos¬ 
sessing  an  MA(<7)  representation  gives  a  process  that  also  has  a  finite  order 
MA  representation  with  order  not  greater  than  q. 

Proposition  11.1  ( Linear  Transformation  of  an  MA(q)  Process) 

Let  ut  be  a  Tv-dimensional  white  noise  process  with  nonsingular  covariance 
matrix  Uu  and  let 


Ut  —  M  +  Ut,  +  MiUt-l  +  •  •  •  +  MqUt-q 

be  a  A-dimensional  invertible  MA(g)  process.  Furthermore,  let  F  be  an  (M  x 
K)  matrix  of  rank  M.  Then  the  M-dimensional  process  zt  =  Fyt  has  an 
invertible  MA(g)  representation, 


Zt  —  Fy  +  Vt  +  NiVt-l  +  •  •  •  +  AltjVt-q, 


where  Vt  is  M-dimensional  white  noise  with  nonsingular  covariance  matrix 
Sv,  the  Ni  are  (M  x  M)  coefficient  matrices  and  q  <  q.  ■ 


We  will  not  give  a  proof  of  this  result  here  but  refer  the  reader  to  Ltitke- 
pohl  (1984)  or  Liitkepohl  (1987,  Chapter  4).  The  proposition  is  certainly  not 
surprising  because  considering  the  autocovariance  matrices  of  Zt,  it  is  seen 
that 

rz(h)  =  E[(Fyt-F^(Fyt_h-F^']  =  Fry(h)F' 

|  'e  FM1+h  EuM'F',  h  =  0, 1, . . . ,  q, 

)  2  —  0 


0, 


h  —  q  +  1,  q  +  2, . . . , 


by  (11.2.7).  Thus,  the  autocovariances  of  zt  for  lags  greater  than  q  are  all  zero. 
This  result  is  a  necessary  requirement  for  the  proposition  to  be  true.  It  also 
helps  to  understand  that  the  MA  order  of  zt  may  be  lower  than  that  of  yt 
because  rz(h )  =  Frv(h)F'  may  be  zero  even  if  ry(h)  is  nonzero. 
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The  proposition  has  some  interesting  implications.  As  we  will  see  in  the  fol¬ 
lowing  (Corollary  11.1.1),  it  implies  that  a  linearly  transformed  VARMAfp,  q) 
process  has  again  a  finite  order  VAR.MA  representation.  Thus,  the  VAR.MA 
class  is  closed  with  respect  to  linear  transformations.  The  same  is  not  true 
for  the  class  of  finite  order  VAR  processes  because,  as  we  will  see  shortly,  a 
linearly  transformed  VAR(p)  process  may  not  admit  a  finite  order  VAR  repre¬ 
sentation.  This,  of  course,  is  an  argument  in  favor  of  considering  the  VAR.MA 
class  rather  than  restricting  the  analysis  to  finite  order  VAR  processes. 

Corollary  11.1.1 

Let  yt  be  a  A-dimensional,  stable,  invertible  VARMA(p,  q)  process  and  let 
F  be  an  (M  x  K)  matrix  of  rank  M.  Then  the  process  Zt  =  Fyt  has  a 
VARMA(p,  q)  representation  with 

p  <  Kp 

and 

q  <  ( I<  —  1  )p  +  q. 


Proof:  We  write  the  process  yt  in  lag  operator  notation  as 

A(L)yt  =  M{L)ut,  (11.6.1) 

where  the  mean  is  set  to  zero  without  loss  of  generality  as  yt  may  represent 
deviations  from  the  mean.  Premultiplying  by  the  adjoint  A(L)ad;>  of  A(L) 
gives 

\A(L)\yt  =  A(L)a*>M(L)ut,  (11.6.2) 

where  A{L)ad^  A(L)  =  \  A(L)  \  has  been  used.  It  is  easy  to  check  that  |  A{z)ad^  \  7^ 
0  for  \z\  <  1.  Thus,  (11.6.2)  is  a  stable  and  invertible  VAR.MA  representation 
of  yt-  Premultiplying  (11.6.2)  with  F  results  in 

\A{L)\zt  =  F  A(L)adi  M(L)ut.  (11.6.3) 

The  operator  A(L)adi AI (L)  is  easily  seen  to  have  degree  at  most  p(K  —  1)  +  q 
and,  thus,  the  right-hand  side  of  (11.6.3)  is  just  a  linearly  transformed  finite 
order  MA  process  which,  by  Proposition  11.1,  has  an  MA(g)  representation 
with 

q  <  p(I<  —  1)  +  q. 


The  degree  of  the  AR  operator  |A(L)|  is  at  most  Kp  because  the  determinant 
is  just  a  sum  of  products  involving  one  operator  from  each  row  and  each 
column  of  A(L).  This  proves  the  corollary.  ■ 
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The  corollary  gives  upper  bounds  for  the  VARMA  orders  of  a  linearly 
transformed  VARMA  process.  For  instance,  if  yt  is  a  VAR(p)=VARMA(p,  0) 
process,  a  linear  transformation  zt  =  Fyt  has  a  VARMA(p,  q)  representation 
with  p  <  Kp  and  q  <  (K—l)p.  For  some  linear  transformations,  q  will  be  zero. 
We  will  see  in  the  following,  however,  that  generally  there  are  transformations 
for  which  the  upper  bounds  for  the  orders  are  attained  and  a  representation 
with  lower  orders  does  not  exist.  This  result  implies  that  a  linear  transfor¬ 
mation  of  a  finite  order  VAR(p)  process  may  not  admit  a  finite  order  VAR 
representation.  Specifically,  the  subprocesses  or  marginal  processes  of  a  K- 
dimensional  process  yt  are  obtained  by  using  transformation  matrices  such  as 
F  =  [l m  '■  0].  Hence,  a  subprocess  of  a  VAR(p)  process  may  not  have  a  finite 
order  VAR  but  just  a  mixed  VARMA  representation. 

For  some  transformations  the  result  in  Corollary  11.1.1  can,  in  fact,  be 
tightened.  Generally,  tighter  bounds  for  the  VARMA  orders  are  available  if 
M  >  1,  as  is  seen  in  the  following  corollary. 

Corollary  11.1.2 

Let  yt  be  a  A'-dimensional,  stable,  invertible  VARMA  (p,  q)  process  and  let 
F  be  an  (M  x  K )  matrix  of  rank  M.  Then  the  process  zt  =  Fyt  has  a 
VARMA  (p,  q)  representation  with 

p  <  ( K  —  M  +  l)p 
and 

q  <  (. K  —  M)p  +  q. 


Proof:  We  first  consider  the  case  where  zt  is  a  subprocess  of  yt  consisting  of 
the  first  M  components.  To  treat  this  case,  we  denote  the  first  M  and  last 
K  —  M  components  of  the  process  yt  by  yu  and  y2t,  respectively,  and  we 
partition  the  VAR  and  MA  operators  as  well  as  the  white  noise  process  Ut 
accordingly.  Thus,  we  can  write  the  process  as 

An(L)y\t  +  Ai2(L)j/2t  =  Mu(L)uit  +  M\2{L)u2t>  (11.6.4) 

A2i(i)yit  +  A22{L)y2t.  =  M2i(T)wit  +  M22{L)u2t-  (11.6.5) 

Premultiplying  (11.6.5)  by  the  adjoint  of  A22(L)  gives 

\A22{L)\y2t  =  —  A22(L)adi  A2i(L)yit  +  A22{L)ad^  M2i{L)u\t 

+A22{L)adHl22(L)u2t.  (11.6.6) 

Moreover,  premultiplying  (11.6.4)  by  |A22(T)|,  replacing  \A22(L)\y2t  by  the 
right-hand  side  of  (11.6.6)  and  rearranging  terms,  we  get 
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[\A22(L)\Ah{L)  —  Ai2{L)A22{L)adi  A2\{L)\yit 

=  [|A22(L)|Mu(L)  -  A12(L)A22(L)adi M21(L)\ult  (11.6.7) 

+[\A22(L)\Mi2(L)  —  Ai2(L)  A22(L)adi  M22(L)]u2t- 


The  VAR  part  of  this  representation  has  order 

p  <  max{(A”  —  M)p  +  p,  {K  —  M  —  1  )p  +  p  +  p}  =  (K  —  M  +  1  )p 

and,  by  Proposition  11.1,  the  right-hand  side  of  (11.6.7)  has  an  MA  represen¬ 
tation  with  order 


q  <  max{(AT  —  M)p  +  q,p  +  ( K  —  M  —  1  )p  +  q}  =  ( K  —  M)p  +  q. 

Hence,  we  have  established  the  corollary  for  transformations  F  =  [1M  :  0] . 

For  a  general  (M  x  K )  transformation  matrix  F  with  rk(F)  =  M,  we 
choose  a  (( K  —  M)  x  K)  matrix  C  such  that  the  ( K  x  K)  matrix 


is  nonsingular  and  we  consider  the  process  Xt  =  $yt..  Because  nonsingular 
transformations  do  not  increase  the  orders  of  a  VAR.MA  process,  Xt  also  has 
a  VAR,MA(p,  q)  representation.  Now  we  get  the  result  of  the  corollary  by 
considering  the  transformation  Zt  =  Fyt  =  [Im  ■  0}xt-  ■ 


Other  bounds  for  the  VAR.MA  orders  than  those  provided  in  Corollaries 
11.1.1  and  11.1.2  for  linearly  transformed  VAR.MA  processes  and  bounds  for 
special  linear  transformations  are  given  in  various  articles  in  the  literature.  For 
further  results  and  references  see  Lutkepohl  (1987,  Chapter  4;  1986,  Kapitel 
2). 

To  illustrate  Corollaries  11.1.1  and  11.1.2,  we  consider  the  bivariate 
VAR.(l)  process 


'  1  -  0.5 L  0.66L 

Hit 

Ult 

0.5  L  1  +  0.3  L 

.  y 2* . 

.  “2 1  _ 

with  Fu  =  I2. 


(11.6.8) 


Here  K  =  2,  p  =  1,  and  q  =  0.  Thus,  Zt  =  [1,0] j/t  =  Hit  as  a  univariate 
(M  =  1)  marginal  process  has  an  AR.MA  representation  with  orders  not 
greater  than  (2,1).  The  precise  form  of  the  process  can  be  determined  with 
the  help  of  the  representation  (11.6.3).  Using  that  representation  gives 


[(1  +  0.3L)(1  -  0.5L)  -  0.66  •  0.5L2]zt 


'1  +  0.3L  -0.66L  ' 

Ult 

-0.5L  1  -  0.5L 

.  U2t  . 

=  (1  +  0.3L)u\t  —  0.66Lu2t- 


(11.6.9) 
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The  right-hand  side,  say  w\t,  is  the  sum  of  an  MA(1)  process  and  a  white 
noise  process.  Thus,  by  Proposition  11.1,  it  is  known  to  have  an  MA(1)  rep¬ 
resentation,  say  wit  =  fit  +  7fi,t-i-  To  determine  7  and  er2  =  Var(uit),  we 
use 

E(w\t)  =  E(vlt  +7Ui,t_i)2  =  (1  +  72)cti 

=  E{{  1  +  0.3L)uit  -  0.66Ltt2t]2  =  1-53 

and 

E(wtwt- i)  =  E[(vit  +  7f1,t-i)(f1,t-i  +  7^1, t-2)]  =  7^1 

=  £7[((l  +  0.3L)ult- 0.661*2, *_!) 

x  ((1  +  0.3L)u±yt-i  -  0.66u2,t-2)] 

=  0.3. 

Solving  this  nonlinear  system  of  two  equations  for  7  and  a2  gives 
7  =  0.204  and  cr?  =  1.47. 

Note  that  we  have  picked  the  invertible  solution  with  |y|  <  1.  Thus,  from 
(11.6.9),  we  get  a  marginal  process 

(1  -  0.2L  -  0.48L2)yit  =  (1  +  0.204 L)vu  with  of  =  1.47. 


In  other  words,  yu  has  indeed  an  ARMA(2, 1)  representation  and  it  is  easy  to 
check  that  cancellation  of  the  AR  and  MA  operators  is  not  possible.  Hence, 
the  ARM  A  orders  are  minimal  in  this  case. 

As  another  example,  consider  again  the  bivariate  VAR(l)  process  (11.6.8) 
and  suppose  we  are  interested  in  the  process  Zt  :=  yit  +  V2t ■  Thus,  F  =  [1, 1]  is 
again  a  (1  x  2)  vector.  Multiplying  (11.6.8)  by  the  adjoint  of  the  VAR  operator 
gives 


(1  -  0.2L-  0.48L2) 


Vit 
2/2 1 


1  +  0.3L  -0.66T  ' 

Ult 

— 0.5L  1  -0.5L 

u2t 

Hence,  multiplying  by  F  gives 

(1  -  0.2L  -  0A8L2)(ylt  +  y2t)  =  {1  -  0.2 L)uu  +  (1  -  1.16 L)u2t. 


Using  similar  arguments  as  for  (11.6.9),  it  can  be  shown  that  the  right-hand 
side  of  this  expression  is  a  process  with  MA(1)  representation  vt  —  0.504vt_i, 
where  a2  :=  Var(ut)  =  2.70.  Consequently,  the  process  of  interest  has  the 
ARM  A  (2, 1)  representation 

(1  —  0.2L  —  0A8L2)zt  =  (1  —  0.504L)ut  with  cr2  =  2.70.  (11.6.10) 


The  following  result  is  of  interest  if  forecasting  is  the  objective  of  the 
analysis. 
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Proposition  11.2  ( Forecast  Efficiency  of  Linearly  Transformed  VAR.MA 
Processes) 

Let  yt  be  a  stable,  invertible,  A'-dimensional  VARMAfp,  q)  process,  let  F  be 
an  (M  x  K)  matrix  of  rank  M,  and  let  zt  =  Fyt.  Furthermore,  denote  the 
MSE  matrices  of  the  optimal  7i-step  predictors  of  yt  and  zt  by  Fy(h)  and 
Ez(h),  respectively.  Then 

Zz{h)  -  FSy(h)F' 

is  positive  semidefinite.  ■ 

This  result  means  that  Fyt(h)  is  generally  a  better  predictor  of  zt+h  with 
smaller  (at  least  not  greater)  MSEs  than  zt(h).  In  other  words,  forecasting 
the  original  process  yt  and  transforming  the  forecasts  is  generally  better  than 
forecasting  the  transformed  process  directly.  A  proof  and  references  for  re¬ 
lated  results  were  given  by  Liitkepohl  (1987,  Chapter  4).  To  see  the  point 
more  clearly,  consider  again  the  example  process  (11.6.8)  and  suppose  we  are 
interested  in  the  sum  of  its  components  zt  =  ylt  +  y2t.  Forecasting  the  bivari¬ 
ate  process  one  step  ahead  results  in  a  forecast  MSE  matrix  Sv(l)  =  Uu  =  L2. 
Thus,  the  corresponding  1-step  ahead  forecast  of  zt  has  MSE 


In  contrast,  if  a  univariate  forecast  is  obtained  on  the  basis  of  the  ARMA(2, 1) 
representation  (11.6.10),  the  1-step  ahead  forecast  MSE  becomes  <7%  =  2.70. 
Clearly,  the  latter  forecast  is  inferior  in  terms  of  MSE. 

Of  course,  these  results  hold  for  VARMA  processes  for  which  all  the  param¬ 
eters  are  known.  They  do  not  necessarily  carry  over  to  estimated  processes,  a 
case  which  was  also  investigated  and  reviewed  by  Liitkepohl  (1987). 


[M]  ■£„(!) 


11.6.2  Aggregation  of  VARMA  Processes 

There  is  little  to  be  added  to  the  foregoing  results  for  the  case  of  contempo¬ 
raneous  aggregation.  Suppose  yt  =  (y±t,  . . .  ,yKt)'  consists  of  K  variables.  If 
all  or  some  of  them  are  contemporaneously  aggregated  by  taking  their  sum 
or  average,  this  just  means  that  yt  is  transformed  linearly  and  the  foregoing 
results  apply  directly.  In  particular,  the  aggregated  process  has  a  finite  order 
VARMA  representation  if  the  original  process  does.  Moreover,  if  forecasts  for 
the  aggregated  variables  are  desired  it  is  generally  preferable  to  forecast  the 
disaggregated  process  and  aggregate  the  forecasts  rather  than  forecast  the 
aggregated  process  directly. 

The  foregoing  results  are  also  helpful  in  studying  the  consequences  of  tem¬ 
poral  aggregation.  Suppose  we  wish  to  aggregate  the  variables  yt  generated 
by 


yt  =  A±yt-i  +  A2yt-2  +ut  + 
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over,  say,  m  =  3  subsequent  periods.  To  be  able  to  use  the  previous  framework, 
we  construct  a  process 
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ym{r— 1)  +  1 
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Ur  :  = 
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we  get 

21q  t)T  =  +  9ft0ur  +  9JtiUr_i,  (11.6.11) 


where  5lo,  2li,  and  91ti  have  the  obvious  definitions.  This  form  is  a 
VARMA(1, 1)  representation  of  the  3/i-dimensional  process  t)r.  Our  standard 
form  of  a  VARMA(1, 1)  process  can  be  obtained  from  this  form  by  premulti¬ 
plying  with  2tg  1  and  defining  oT  =  21q  19Jt0ur  which  gives 

t)r  =  21q  —  1  +  t>r  T  21q  —  1- 

Now  temporal  aggregation  over  m  =  3  periods  can  be  represented  as  a  linear 
transformation  of  the  process  t)r.  Clearly,  it  is  not  difficult  to  see  that  this 
method  generalizes  for  higher  order  processes  and  temporal  aggregation  over 
more  than  three  periods.  Moreover,  different  types  of  temporal  aggregation 
can  be  handled.  For  instance,  the  aggregate  may  be  the  sum  of  subsequent 
values  or  it  may  be  their  average.  Furthermore,  temporal  and  contemporane¬ 
ous  aggregation  can  be  dealt  with  simultaneously.  In  all  of  these  cases,  the 
aggregate  has  a  VARMA  representation  if  the  original  variables  are  gener¬ 
ated  by  a  finite  order  VARMA  process  and  its  structure  can  be  studied  using 
the  foregoing  framework.  Moreover,  by  Proposition  11.2,  if  forecasts  of  the 
aggregate  are  of  interest,  it  is  in  general  preferable  to  forecast  the  original 
disaggregated  process  and  aggregate  the  forecasts  rather  than  forecast  the  ag¬ 
gregate  directly.  A  detailed  discussion  of  these  issues  and  also  of  forecasting 
with  estimated  processes  can  be  found  in  Liitkepohl  (1987). 
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11.7  Interpretation  of  VARMA  Models 


The  same  tools  and  concepts  that  we  have  used  for  interpreting  VAR  models 
may  also  be  applied  in  the  VARMA  case.  We  will  consider  Granger-causality 
and  impulse  response  analysis  in  turn. 

11.7.1  Granger-Causality 

To  study  Granger-causality  in  the  context  of  VARMA  processes,  we  partition 
yt  in  two  groups  of  variables,  Zt  and  Xt,  and  we  partition  the  VAR  and  MA 
operators  as  well  as  the  white  noise  process  ut  accordingly.  Hence,  we  get 

An  (L)  A 12  (L)  Zt  _  Mn(L)  M12(L)  uu  ,, , 

A2l(T)  A22  [L)  Xt  _  M'2l(L)  M22(L)  _  U2t 

where  again  a  zero  mean  is  assumed  for  simplicity  and  without  loss  of  gener¬ 
ality.  The  results  derived  in  the  following  are  not  affected  by  a  nonzero  mean 
term.  The  process  (11.7.1)  is  assumed  to  be  stable  and  invertible  and  its  pure, 
canonical  MA  representation  is 

Zt  _  $n{L)  $12  (L)  u\t 

_  Xt  _  <&2i{L)  $22  (L)  _  _  U2t 

From  Proposition  2.2,  we  know  that  xt  is  not  Granger-causal  for  zt  if  and 
only  if  ^12 (L)  =  0.  Although  the  proposition  is  stated  for  VAR  processes,  it  is 
easy  to  see  that  it  remains  correct  for  the  presently  considered  VARMA  case. 
We  also  know  that 

$n(L)  $12  (L) 

$21  (L)  $22  (L) 

An(L)  Ai2(L)  Mh(L)  Mia(L) 

A2i(T)  A22(L)  M2i(L)  M22(L) 

D(L) 

— A22(T)  1A2i(L)D(L) 

— D  (L)  Ai2(L)  A22(L)~1 

A22{L)  1+A22(L)  1A2i(L)H(L)Ai2(L)A22(T)_1 
'  Mn(L)  M12(L)  ' 

x  _  m2\(l)  m22(l)  y 

where 

D(L)  :=  [An(L)  —  A12(T)A22(T)  1A21(L)]  1 

and  the  rules  for  the  partitioned  inverse  have  been  used  (see  Appendix  A.  10). 
Consequently,  Xt  is  not  Granger-causal  for  zt  if  and  only  if 

0  =  D(L)Mi2(L)  —  Z?(T)Ai2(T)A22(T)  1A122(L) 
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or,  equivalently, 

Mi2(L)  —  Ai2(L)A22(L)  1M22(L)  =  0. 

Moreover,  it  follows  as  in  Proposition  2.3  that  there  is  no  instantaneous  causal¬ 
ity  between  xt  and  zt  if  and  only  if  E(uitu'2t)  =  0.  We  state  these  results  as 
a  proposition. 

Proposition  11.3  ( Characterization  of  Noncausality) 

Let 


be  a  stable  and  invertible  VARMA (p,  q)  process  as  in  (11.7.1)  with  possibly 
nonzero  mean.  Then  xt  is  not  Granger-causal  for  Zt  if  and  only  if 

M12(L)  =  A12(L)A22(L)-1M22(L).  (11.7.2) 

There  is  no  instantaneous  causality  between  zt  and  Xt  if  and  only  if 

E(ultu'2t)  =  0. 


Remark  1  Obviously,  the  restrictions  characterizing  Granger-noncausality 
are  not  quite  so  easy  here  as  in  the  VAR(p)  case.  Consider,  for  instance,  a 
bivariate  VARMA(1,1)  process 


Zt 

_ 

041,1 

Oi2,l 

Zt- 1 

+ 

Hit 

+ 

77711,1 

777i2,i 

7*1, 7  —  1 

Xt  _ 

o2i,i 

022,1 

.  Xt~1 

.  U2t  . 

_  77721,1 

?7722,i 

.  U2,t-1 

For  this  process,  the  restrictions  (11.7.2)  reduce  to 
wi2.i L  =  (— ai2)iL)(l  -  a22,iL)_1(l  +  m22aL) 


or 


(1  —  Qt22,iL)mi2,iL  —  —(1  +  m-22,1  L)a\2,iL 


or 


77712.1  —  —ce  12,1  and  <a22ji?7li2ji  —  Q'i2,im,22,l- 

This,  of  course,  is  a  set  of  nonlinear  restrictions  whereas  only  linear  constraints 
were  required  to  characterize  Granger-noncausality  in  the  corresponding  pure 
VAR(p)  case.  However,  a  sufficient  condition  for  (11.7.2)  to  hold  is 

M12(L)  =  A12(L)=0,  (11.7.3) 


which  is  again  a  set  of  linear  constraints.  Occasionally,  these  sufficient  condi¬ 
tions  may  be  easier  to  test  than  (11.7.2).  ■ 
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Remark  2  To  turn  the  arguments  put  forward  prior  to  Proposition  11.3  into 
a  formal  proof  requires  that  we  convince  ourselves  that  all  the  operations  per¬ 
formed  with  the  matrices  of  lag  polynomials  are  feasible  and  correct.  Because 
we  have  not  proven  these  results,  the  arguments  should  just  be  taken  as  an 
indication  of  how  a  proof  may  proceed.  ■ 


11.7.2  Impulse  Response  Analysis 

The  impulse  responses  and  forecast  error  variance  decompositions  of  a  VARMA 
model  are  obtained  from  its  pure  MA  representation,  as  in  the  finite  order 
VAR  case.  Thus,  the  discussion  of  Sections  2.3.2  and  2.3.3  carries  over  to  the 
present  case,  except  that  the  'Pi’s  are  computed  with  different  formulas.  Also, 
Propositions  2.4  and  2.5  need  modification.  We  will  not  give  the  details  here 
but  refer  the  reader  to  the  exercises  (see  Problem  11.9). 

It  may  be  worth  reiterating  some  caveats  of  impulse  response  analysis 
which  may  be  more  apparent  now  after  the  discussion  of  transformations  in 
Section  11.6.  In  particular,  we  have  seen  there  that  dropping  variables  (consid¬ 
ering  subprocesses)  or  aggregating  the  components  of  a  VARMA  process  tem¬ 
porally  and/or  contemporaneously  results  in  possibly  quite  different  VARMA 
structures.  They  will  in  general  have  quite  different  coefficients  in  their  pure 
MA  representations.  In  other  words,  the  impulse  responses  may  change  dras¬ 
tically  if  important  variables  are  excluded  from  a  system  or  if  the  level  of 
aggregation  is  altered,  for  instance,  if  quarterly  instead  of  monthly  data  are 
considered.  Again,  this  does  not  necessarily  render  impulse  response  analysis 
useless.  It  should  caution  the  reader  against  over  interpreting  the  evidence 
from  VARMA  models,  though.  Some  thought  must  be  given  to  the  choice  of 
variables,  the  level  of  aggregation,  and  other  transformations  of  the  variables. 


11.8  Exercises 

Problem  11.1 

Write  the  MA(1)  process  yt  =  ut  +  in  VAR(l)  form,  Yt  =  AYt_i  +  Ut, 

and  determine  A*  for  i  =  1,2. 

Problem  11.2 

Suppose  yt  =  A\yt-i  +  Ut  +  A'hut-i  +  M2Ut~ 2  is  a  stable  and  invertible 
VARMA(1,2)  process.  Determine  the  coefficient  matrices  77j,  i  =  1,2,  3, 4,  of 
its  pure  VAR  representation  and  the  coefficient  matrices  Pi,  i  =  1,2, 3, 4,  of 
its  pure  MA  representation. 

Problem  11.3 

Evaluate  the  autocovariances  Py(h),  h  =  1,2, 3,  of  the  bivariate  VARMA(2, 1) 
process 
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_  '  -3  1  ,  [  -5  .1  1  ,[001  ,  _T  .6  .2  1 

Vt  —  5  +  4  5  Vt- 1+  25  o  0  3  Ut_1'  (H-8-1) 

(Hint:  The  use  of  a  computer  will  greatly  simplify  this  problem.) 

Problem  11. 4 

Write  the  VARMA(1, 1)  process  yt  =  A\yt-\  +  ut  +  M±ut-i  in  VAR(l)  form, 
Y,  =  AK_i  +  Ut.  Determine  forecasts  Yt(h)  =  A hY*  for  h  =  1,2,3,  and 
compare  them  to  forecasts  obtained  from  the  recursive  formula  (11.5.4). 

Problem  11.5 

Derive  a  univariate  AR.MA  representation  of  the  second  component,  y2t ,  of 
the  process  given  in  (11.6.8). 

Problem  11.6 

Provide  upper  bounds  for  the  AR.MA  orders  of  the  process  zt  =  yit  +  y2t  +  V3t, 
where  yt  =  (yu,y2t,y3t,yAt)'  is  a  4-dimensional  VARMA(3,3)  process. 

Problem  11.7 

Write  the  VARMA(1, 1)  process  yt  from  Problem  11.4  in  a  form  such  as 
(11.6.11)  that  permits  to  analyze  temporal  aggregation  over  four  periods  in  the 
framework  of  Section  11.6.2.  Give  upper  bounds  for  the  orders  of  a  VAR.MA 
representation  of  the  process  obtained  by  temporally  aggregating  yt  over  four 
periods. 

Problem  11.8 

Write  down  explicitly  the  restrictions  characterizing  Granger-noncausality  for 
a  bivariate  VARMA(2, 1)  process.  Is  y±t  Granger-causal  for  y-2t  in  the  process 
(11.8.1)? 

Problem  11.9 

Generalize  Propositions  2.4  and  2.5  to  the  VAR,MA(p,  q)  case. 

(Hint:  Show  that  for  a  A'-dimensional  VAR.MA(p,  q)  process, 

<j>jk,i=  0,  for  i  =  1,2,..., 

is  equivalent  to 

<t>jk,i  =  0,  for  *  =  1,2,...  ,p(K  -  1)  +  q; 
and 

0jk,i  =  0,  for  i  =  0, 1,2,..., 
is  equivalent  to 


0jk,i  =  0,  for  *  =  0,1,...  ,p(K  -  1)  +  q.) 
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Problem  11.10 

Suppose  that  m  is  a  real  number  with  \m\  >  1  and  ut  is  a  white  noise  process. 
Show  that  the  process 


vt 


(1  +  mL)ut 


is  also  white  noise  with  Var(i^)  =  m2Var(ut). 


12 


Estimation  of  VARMA  Models 


In  this  chapter,  maximum  likelihood  estimation  of  the  coefficients  of  a  VARMA 
model  is  considered.  Before  we  can  proceed  to  the  actual  estimation,  a  unique 
set  of  parameters  must  be  specified.  In  this  context,  the  problem  of  nonunique¬ 
ness  of  a  VARMA  representation  becomes  important.  This  identification  prob¬ 
lem,  that  is,  the  problem  of  identifying  a  unique  structure  among  many  equiv¬ 
alent  ones,  is  treated  in  Section  12.1.  In  Section  12.2,  the  Gaussian  likelihood 
function  of  a  VARMA  model  is  considered.  A  numerical  algorithm  for  maxi¬ 
mizing  it  and,  thus,  for  computing  the  actual  estimates  is  discussed  in  Section 
12.3.  The  asymptotic  properties  of  the  ML  estimators  are  the  subject  of  Sec¬ 
tion  12.4.  Forecasting  with  estimated  processes  and  impulse  response  analysis 
are  dealt  with  in  Sections  12.5  and  12.6,  respectively. 


12.1  The  Identification  Problem 

12.1.1  Nonuniqueness  of  VARMA  Representations 

In  the  previous  chapter,  we  have  considered  A'-dimensional,  stationary  pro¬ 
cesses  yt  with  VARMA(p,  q )  representations 

Ut  =  +  •  •  •  +  Avyt-p  +  Ut  +  MiUt-i  +  •  •  •  +  Mqut~q.  (12.1.1) 

Because  the  mean  term  is  of  no  importance  for  the  presently  considered  prob¬ 
lem,  we  have  set  it  to  zero.  Therefore,  no  intercept  term  appears  in  (12.1.1). 
This  model  can  be  written  in  lag  operator  notation  as 

A(L)yt  =  M(L)ut,  (12.1.2) 


where  A(L)  :=  I K  -  A1L - ApLP  and  M{L)  :=  IK  +  M1L  +  •  •  •  +  MqLq. 

Assuming  that  the  VARMA  representation  is  stable  and  invertible,  the  well- 
defined  process  described  by  the  model  (12.1.1)  or  (12.1.2)  is  given  by 
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Ut  =*  y:  $jUt-i  =3  ${L)ut  — A(L)  1M(L)ut. 
i= o 

In  practice,  it  is  sometimes  useful  to  consider  a  slightly  more  general  type 
of  VAR.MA  model  by  attaching  nonidentity  coefficient  matrices  to  yt  and  wt, 
that  is,  one  may  want  to  consider  representations  of  the  type 

Aq yt  =  A\yt-i  +  •  •  •  +  Apyt-p  +  MyVt  +  +  •  •  •  +  Mqvt~q,  (12.1.3) 

where  Vt.  is  a  suitable  white  noise  process.  Such  a  form  may  be  suggested  by 
subject  matter  theory  which  may  imply  instantaneous  effects  of  some  variables 
on  other  variables.  It  will  also  turn  out  to  be  useful  in  finding  unique  structures 
for  VAR.MA  models.  By  the  specification  (12.1.3)  we  mean  the  well-defined 
process 

yt  =  (^lo  ~  A\L  —  ■  ■  ■  —  APLP)  1(Mq  +  M\L  +  •  •  •  +  MqLq)vt- 

Such  a  process  has  a  standard  VARMA(p,  q)  representation  with  identity  coef¬ 
ficient  matrices  attached  to  the  instantaneous  yt  and  Ut  if  Aq  and  Mq  are  non¬ 
singular.  To  see  this,  we  premultiply  (12.1.3)  by  Aq1  and  define  ut  =  A^M0vt 
which  gives 

yt  =  A0  1  Aiyt-i  +  •  •  •  +  A0  1  Apyt-p  +  ut  +  A0  1MiAI0  1  A^ut-i  +  ••• 
+Aq  1MqM.Q  1A0Ut_q. 

Redefining  the  matrices  appropriately,  this,  of  course,  is  a  representation  of 
the  type  (12.1.1)  with  identity  coefficient  matrices  at  lag  zero  which  describes 
the  same  process  as  (12.1.3).  The  assumption  that  both  Aq  and  Mq  are  nonsin¬ 
gular  does  not  entail  any  loss  of  generality,  as  long  as  none  of  the  components 
of  yt  can  be  written  as  a  linear  combination  of  the  other  components.  We  call 
a  stable  and  invertible  representation  as  in  (12.1.1)  a  VAR.MA  representation 
in  standard  form  or  a  standard  VAR.MA  representation  to  distinguish  it  from 
representations  with  nonidentity  matrices  at  lag  zero  as  in  (12.1.3).  This  dis¬ 
cussion  shows  that  VARMA  representations  are  not  unique,  that  is,  a  given 
process  yt  can  be  written  in  standard  form  or  in  nonstandard  form  by  premul¬ 
tiplying  by  any  nonsingular  ( K  x  K)  matrix.  We  have  encountered  a  similar 
problem  in  dealing  with  finite  order  structural  VAR  processes  in  Chapter  9. 
However,  once  we  consider  standard  reduced  form  VAR  models  only,  we  have 
unique  representations.  This  property  is  in  sharp  contrast  to  the  presently 
considered  VAR.MA  case,  where,  in  general,  a  standard  form  is  not  a  unique 
representation,  as  we  will  see  shortly. 

It  may  be  useful  at  this  stage  to  emphasize  what  we  mean  by  equivalent 
representations  of  a  process.  Generally,  two  representations  of  a  process  yt 
are  equivalent  if  they  give  rise  to  the  same  realizations  (except  on  a  set  of 
measure  zero)  and,  thus,  to  the  same  multivariate  distributions  of  any  finite 
subcollection  of  variables  yt,  yt+i,  ■  ■  ■ ,  Vt+h,  f°r  arbitrary  integers  t  and  h. 
Of  course,  this  specification  just  says  that  equivalent  representations  really 
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represent  the  same  process.  If  yt  is  a  zero  mean  process  with  canonical  MA 
representation 

OO 

Vt  =  $o  =  Ik, 

i= 0 

=  <P(L)ut , 

where  <£(L)  :=  then  any  VARMA  model  A(L)yt 

which 

A(L)~1M(L)  =  <P(L)  (12.1.5) 

is  an  equivalent  representation  of  the  process  yt..  In  other  words,  all  VARMA 
models  are  equivalent  for  which  A(L)~1M(L)  results  in  the  same  operator 
'P(L).  Thus,  in  order  to  ensure  uniqueness  of  a  VARMA  representation,  we 
must  impose  restrictions  on  the  VAR  and  MA  operators  such  that  there  is 
precisely  one  feasible  pair  of  operators  A(L)  and  M(L)  satisfying  (12.1.5)  for 
a  given 

Obviously,  given  some  stable,  invertible  VARMA  representation  A(L)yt  = 
M(L)ut,  an  equivalent  representation  results  if  we  premultiply  by  any  non¬ 
singular  matrix  A0.  Therefore,  to  remove  this  source  of  nonuniqueness,  let  us 
for  the  moment  focus  on  VARMA  representations  in  standard  form.  As  men¬ 
tioned  earlier,  even  then  uniqueness  is  not  ensured.  To  see  this  problem  more 
clearly,  let  us  consider  a  bivariate  VARMA(1, 1)  process  in  standard  form, 

Vt  =  +  ut  +  MiUt-i-  (12.1.6) 

From  Section  11.3.1,  we  know  that  this  process  has  the  canonical  MA  repre¬ 
sentation 

OO  OO 

Vt  =  Y,  =  Ut  +  H-i-  (12.1.7) 

i= 0  i=  1 

Thus,  for  example,  any  VARMA(1,1)  representation  with  Mi  =  —A\  will 
result  in  the  same  canonical  MA  representation.  In  other  words,  if  it  turns 
out  that  yt  is  such  that  Mi  =  —Ai  for  some  set  of  coefficients,  then  any 
choice  of  Ai  matrix  that  gives  rise  to  a  stable  VAR  operator  can  be  matched 
by  an  Mi  matrix  that  leads  to  an  equivalent  VARMA(1, 1)  representation 
of  yt-  Of  course,  in  this  case,  the  MA  coefficient  matrices  in  (12.1.7)  are 
in  fact  all  zero  and  yt  =  Ut  is  really  white  noise,  that  is,  yt  actually  has 
a  VARMA(0,0)  structure.  This  fact  is  also  quite  easy  to  see  from  the  lag 
operator  representation  of  (12.1.6), 

(h  ^  AiL)yt  =  (i2  +  MiL)ut. 

Of  course,  if  Mi  =  —A±,  the  MA  operator  cancels  against  the  VAR  operator. 
This  type  of  parameter  indeterminacy  is  also  known  from  univariate  ARMA 


(12.1.4) 
=  M(L)ut  for 
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processes.  It  is  usually  ruled  out  by  the  assumption  that  the  AR  and  MA 
operators  have  no  common  factors.  Let  us  make  a  similar  assumption  in  the 
presently  considered  multivariate  case  by  requiring  that  yt  is  not  white  noise, 
i.e.,  M\  ^  —A±. 

Unfortunately,  in  the  multivariate  case,  the  nonuniqueness  problem  is  not 
solved  by  this  assumption.  To  see  this,  suppose  that 


A  i 


0  a 
0  0 


and  Mi  =  0, 


where  a  =4  0.  In  this  case,  the  canonical  MA  representation  (12.1.4)  has  coef¬ 
ficient  matrices 


$i=Ai,  =@3  =  •••  =0,  (12.1.8) 

because  A\  =  0  for  i  >  1.  The  same  MA  representation  results  if 

Ax  =  0  and  Mi  =  [  °  “  1  . 


More  generally,  a  canonical  MA  representation  with  coefficient  matrices  as  in 
(12.1.8)  is  obtained  if 


A  i 


0  a  +  m 
0  0 


and  Mi  = 


0  —TO 

0  0 


whatever  the  value  of  to.  Note  also  that  the  VARMA  representation  will  be 
stable  and  invertible  for  any  value  of  in. 

To  understand  where  the  parameter  indeterminacy  comes  from,  consider 
the  VAR  operator 


h 


0  a 
0  0 


(12.1.9) 


The  inverse  of  this  operator  is 


h  + 


0  a 
0  0 


L, 


(12.1.10) 


which  is  easily  checked  by  multiplying  the  two  operators  together.  Thus,  the 
operator  (12.1.9)  has  a  finite  order  inverse.  Operators  of  this  type  are  precisely 
the  ones  that  cause  trouble  in  setting  up  a  uniquely  parameterized  VARMA 
representation  of  a  given  process  because  multiplying  by  such  an  operator 
may  cancel  part  of  one  operator  (VAR  or  MA)  while  at  the  same  time  the 
finite  order  of  the  other  operator  is  maintained. 

To  get  a  better  sense  for  this  problem,  let  us  look  at  the  following 
VARMA(1,1)  process: 


A(L)yt  =  M(L)ut , 
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where 

I"  1  —  «nL  —  a\2L  1  .  .  I"  1  +  mi\L  m\2L 

A(L)  :=  Q  1  and  M(L)  :=  Q  x 

The  two  operators  do  not  cancel  if  an  7^  —mu  and  «i2  7^  —  77112-  Still  we  can 
factor  an  operator 


from  both  operators  without  changing  their  general  structure: 

A(L)  =  D(L)  \  1"«llL  -(7  +  ai2)£l 

M(L)  =  D(L)[1  +  ™llL  (^-7)^' 

Cancelling  £)(L)  gives  operators 

1  —  OnL  —  (7  +  «i2)T  _  n/r\  1  +  ctnL  — (27  +  012)^ 

0  1  J  “  U^L)  [  0  1 

and 

1  +  mnL  (mi2  -  7)^  1  _  n,rs  [  1  +  miiT  (mi2-2^)L 

0  1  J  _  D^L>  [0  1 

Thus,  we  can  again  factor  and  cancel  D(L).  In  fact,  we  can  cancel  D(L)  as 
often  as  we  like  without  changing  the  general  structure  of  the  process.  Hence, 
even  if  the  orders  of  both  operators  cannot  be  reduced  simultaneously  by 
cancellation,  it  may  still  be  possible  to  factor  some  operator  from  both  A(L) 
and  M(L)  without  changing  their  general  structure.  Note  that  the  troubling 
operator  D{L)  is  again  one  with  finite  order  inverse, 


Finite  order  operators  that  have  a  finite  order  inverse  are  characterized  by 
the  property  that  their  determinant  is  a  nonzero  constant,  that  is,  it  does  not 
involve  L  or  powers  of  L.  Operators  with  this  property  are  called  unimodular. 
For  instance,  the  operator  (12.1.9)  has  determinant, 


1  —  aL  _ 
0  1 


and,  hence,  it  is  unimodular.  The  property  of  a  unimodular  operator  to  have 
a  finite  order  inverse  follows  because  the  inverse  of  an  operator  A(L)  is  its 
adjoint  divided  by  its  determinant, 
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A(L)-1  =  A(L)adj /\A(L)\  =  \A{L)\~1A{L)ad:>. 

The  determinant  is  a  univariate  operator.  A  finite  order  invertible  univariate 
operator,  however,  has  an  infinite  order  inverse,  unless  its  degree  is  zero,  that 
is,  unless  it  is  a  constant. 

In  order  to  state  uniqueness  conditions  for  a  VARMA  representation,  we 
will  first  of  all  require  that  a  representation  is  chosen  for  which  further  can¬ 
cellation  is  not  possible  in  the  sense  that  there  are  no  common  factors  in  the 
VAR  and  MA  parts,  except  for  unimodular  operators.  Operators  A(L)  and 
M(L)  with  this  property  are  left-coprime.  This  property  may  be  defined  by 
calling  the  matrix  operator  [A(L)  :  M(L)\  left-coprime,  if  the  existence  of 
operators  D{L ),  A(L),  and  M{L)  satisfying 

D(L)[A(L)  :  M{L)\  =  [A(L)  :  M(L)\  (12.1.11) 

implies  that  D(L)  is  unimodular,  that  is,  \D(L)\  is  a  nonzero  constant.  From 
the  foregoing  examples,  it  should  be  understood  that  in  general  factoring 
unimodular  operators  from  A{L)  and  M{L)  is  unavoidable  if  no  further  con¬ 
straints  are  imposed.  Thus,  to  obtain  uniqueness  of  left-coprime  operators  we 
have  to  impose  restrictions  ensuring  that  the  only  feasible  unimodular  oper¬ 
ator  D(L)  in  (12.1.11)  is  D(L)  =  Ik-  We  will  now  give  two  sets  of  conditions 
that  ensure  uniqueness  of  a  VARMA  representation. 


12.1.2  Final  Equations  Form  and  Echelon  Form 


Suppose  yt  is  a  stationary  zero  mean  process  that  has  a  stable,  invertible 
VARMA  representation, 

A(L)yt  =  M(L)ut,  (12.1.12) 

where  A(L)  :=  A0  -  A±L - ApLp  and  M{L)  :=  M0  +  MXL  +  •  •  •  +  MqLq. 

Further  suppose  that  A(L)  and  M(L)  are  left-coprime  and  the  white  noise 
covariance  matrix  £u  is  nonsingular. 


Definition  12.1  ( Final  Equations  Form) 

The  VARMA  representation  (12.1.12)  is  said  to  be  in  final  equations  form  if 

Mu  =  IK  and  A(L)  =  a(L)IK,  where  a(L )  :=  1  —  a.\L - apLp  is  a  scalar 

(one-dimensional)  operator  with  ap  ^  0.  ■ 

For  instance,  the  bivariate  VARMA (3, 1)  model 


(1  —  oc\L  —  OC2IA 


03  L3) 


y  it 

Vit 


1  +  m.n.iL 

TOi2,lT 

Wit 

m2i,iL 

1  +  TO22,lT 

.  U2t  . 

(12.1.13) 


with  03  ^  0,  is  in  final  equations  form.  The  label  “final  equations  form”  for 
this  type  of  VARMA  representation  is  in  line  with  the  terminology  used  in 
Chapter  10,  Section  10.2.2. 
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Uniqueness  of  the  final  equations  form 

a(L)yt  =  M(L)ut 

is  seen  by  noting  that  D(L)  =  1K  is  the  only  operator  that  retains  the  scalar 
AR  part  upon  multiplication.  For  the  operator  D(L)a(L)lx  to  maintain  the 
order  p ,  the  operator  D(L)  must  have  degree  zero,  that  is,  D(L)  =  D.  How¬ 
ever,  the  only  possible  matrix  D  that  guarantees  a  zero  order  matrix  Ik  for 
the  VAR  operator  is  D  =  IK . 

Definition  12.2  ( Echelon  Form) 

The  VAR.MA  representation  (12.1.12)  is  said  to  be  in  echelon  form  or  AR.MA# 
form  if  the  VAR  and  MA  operators  A(L)  =  [aki(L)\k,i=i,...,K  and  M(L )  = 
[mki(L)]  are  left-coprime  and  satisfy  the  following  conditions:  The  operators 
otki{L)  (i  =  1, . . . ,  K)  and  mkj(L)  (j  =  1, . . . , K)  in  the  fc-th  row  of  A{L)  and 
M{L)  have  degree  pk  and  they  have  the  form 

Pk 

Hfcfc(T)  1  ^  '  &kk,j ^ i  for  k  1, . . . ,  .A, 

l=i 

Pk 

oiki{L)  =  -  ^2  for  k  ^  i, 

j=Pk~Pki  + 1 

and 


Pk 

mki{L)  =  E  mkijL\  for  k,i  =  1, . . . ,  K ,  with  M0  =  A0. 

3=0 


In  the  VAR  operators  aki{L), 


f  mm(pk  +  l,pi) 
\  min (pk,Pi) 


for  k  >  i, 
for  k  <  i, 


k,  i  =  1, . . . ,  K. 


(12.1.14) 


That  is,  pki  specifies  the  number  of  free  coefficients  in  the  operator  aki(L) 
for  i  k.  The  row  degrees  (jp±, . . .  ,pk)  are  called  the  Kronecker  indices  and 
their  sum  'f2k=1  Pi  is  the  McMillan  degree.  Obviously,  for  the  VAR.MA  orders 
we  have,  in  general,  p  =  q  =  max(pi , . . .  ,pk)-  ■ 


We  will  sometimes  denote  an  echelon  form  VAR.MA  model  with  Kronecker 
indices  (pi, . . .  ,pk )  by  ARMA^pi, . . .  ,px)-  The  following  model  is  an  ex¬ 
ample  of  a  bivariate  VAR.MA  process  in  echelon  form  or,  more  precisely,  an 
ARMAb  (2, 1): 


1  —  U\ipL  —  otn^L"  —Q.i2t2-L2 

Hit 

021,0  —  Ot21,lL  1  —  Ot22, \L 

V2t  _ 

1  +  m\i\L  +  171112-b2 
—Ot  21,0  +  m  21, lL 


r?Zi2,i  L  +  TO12.2T2 
1  +  TO22,lT 


Ult 

.  M2 1  _ 

(12.1.15) 
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or 


r  1 

0  " 

2/14 

|_  —OL  21,0  1  J 

2/24 

04l,:l 

0  1 

<421,1 

£*22,1  J 

1  0 
—  £*21,0  1 


2/14-1 

2/24-1 

«14 

«24 


Ctll,2  <412,2 
0  0 


2/14-2 

2/24-2 

+ 


mu,! 

m.2i,i 


m12,i 
m2  2,i 


1*14-1 

.  u2,t-l  . 

m11;2 

0 


m  12>2 

0 


Wl,t-2 

W24-2 

In  this  model,  the  Kronecker  indices  (row  degrees)  are  p\  =  2  and  p2  =  1. 
Thus,  the  McMillan  degree  is  3.  The  pki  numbers  are 


Pll 

Pl2 

'  2 

1  ' 

P21 

P22 

2 

1 

(see  (12.1.14)).  The  off-diagonal  elements  pi2  and  p2i  of  this  matrix  indicate 
the  numbers  of  parameters  contained  in  the  operators  aq 2(L)  and  a21(L), 
respectively.  Because  oq 2(L)  belongs  to  the  first  row  or  first  equation  of  the 
system,  it  has  degree  p±  =  2.  Hence,  because  it  has  just  one  free  coefficient 
(P12  =  1),  it  has  the  form  oq2(L)  =  — aq2,2L2.  Similarly,  a2i  (L)  belongs  to  the 
second  row  of  the  system  and,  thus,  it  has  degree  p2  =  1.  Because  it  has  p2i  =  2 
free  coefficients,  it  must  be  of  the  form  a21(L)  =  — a2i,o  —  0121,1  L.  Another 
characteristic  feature  of  the  echelon  form  is  that  Aq  is  lower-triangular  and 
has  ones  on  the  main  diagonal.  Moreover,  the  zero  order  MA  coefficient  matrix 
is  identical  to  the  zero  order  VAR  matrix,  M0  =  Aq. 

Some  free  coefficients  of  the  echelon  form  of  a  VARMA  model  may  be 
zero  and,  hence,  p  or  q  may  be  less  than  max(pi, . . .  ,pm).  For  instance,  in 
the  example  process  (12.1.15),  mnj2  and  toi2)2  may  be  zero.  In  that  case, 
q  =  1  <  max  (pi ,  p2)  =  2.  In  order  for  a  representation  to  be  an  echelon  form 
with  Kronecker  indices  (pi , . . .  ,pk),  at  least  one  operator  in  the  fc-th  row  of 
[A(L)  :  M(L)\  must  have  degree  pk,  with  nonzero  coefficient  at  lag  pk- 

An  echelon  is  a  certain  positioning  of  an  army  in  the  form  of  steps.  Sim¬ 
ilarly,  the  nonzero  parameters  in  an  echelon  VARMA  representation  are  po¬ 
sitioned  in  a  specific  way.  In  particular,  the  positioning  of  freely  varying  pa¬ 
rameters  in  the  fc-th  equation  depends  only  on  Kronecker  indices  pi  <  pk  and 
not  on  Kronecker  indices  pj  >  pk ■  More  precisely,  as  long  as  pj  >  pk,  the 
positioning  of  the  free  parameters  in  the  fc-th  equation  will  be  the  same  for 
any  value  Pj.  For  the  example  process  (12.1.15),  it  is  easy  to  check  that  the 
positions  of  the  free  parameters  in  the  second  equation  will  remain  the  same 
if  the  row  degree  of  the  first  equation  is  increased  to  pi  =  3.  In  other  words, 
p2i  does  not  change  due  to  an  increase  in  pi . 

It  can  be  shown  that  the  echelon  form,  just  like  the  final  equations  form, 
guarantees  uniqueness  of  the  VARMA  representation.  In  other  words,  if  a 
VARMA  representation  is  in  echelon  form,  then  the  representation  is  unique 
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within  the  class  of  all  echelon  representations.  A  similar  statement  applies 
for  the  final  equations  form.  Also,  for  any  stable,  invertible  VARMA(p,  q) 
representation,  there  exists  an  equivalent  echelon  form  and  an  equivalent  final 
equations  form. 

The  reader  may  wonder  why  we  consider  the  complicated  looking  echelon 
representation  although  the  final  equations  form  serves  the  same  purpose.  The 
reason  is  that  the  echelon  form  is  usually  preferable  in  practice  because  it  often 
involves  fewer  free  parameters  than  the  equivalent  final  equations  form.  We 
will  see  an  example  of  this  phenomenon  shortly.  Having  as  few  free  parameters 
as  possible  is  important  to  ease  the  numerical  problems  in  maximizing  the 
likelihood  function  and  to  gain  efficiency  of  the  parameter  estimators. 

There  are  a  number  of  other  unique  or  identified  parameterizations  of 
VAR.MA  models.  We  have  chosen  to  present  the  final  equations  form  and  the 
echelon  form  because  these  two  forms  will  play  a  role  when  we  discuss  the  issue 
of  specifying  VAR.MA  models  in  Chapter  13.  For  proofs  of  the  uniqueness  of 
the  echelon  form  and  for  other  identification  conditions  we  refer  to  Hannan 
(1969,  1970,  1976,  1979),  Deistler  &  Hannan  (1981),  and  Hannan  &  Deistler 
(1988).  We  now  proceed  with  illustrations  of  the  final  equations  form  and  the 
echelon  form. 


12.1.3  Illustrations 


Starting  from  some  VAR.MA (p,  q)  representation  A(L)yt  =  M ( L)ut ,  one  strat¬ 
egy  for  finding  the  corresponding  final  equations  form  results  from  premulti¬ 
plying  with  the  adjoint  A(L)adj  of  the  VAR  operator  A(L)  which  gives 

\A(L)\yt  =  A(L)adjM(L)ut,  (12.1.16) 

where  A(L)ad:>  A(L)  =  |A(L)|  has  been  used.  Obviously,  (12.1.16)  has  a  scalar 
VAR  operator  and,  hence,  is  in  final  equations  form  if  all  superfluous  terms 
are  cancelled. 

To  find  the  echelon  form  corresponding  to  a  given  VARMA  model,  we  have 
to  cancel  as  much  as  possible  so  as  to  make  the  VAR  and  MA  operators  left- 
coprime.  Then  a  unimodular  matrix  operator  has  to  be  determined  which, 
upon  premultiplication,  transforms  the  given  model  into  an  echelon  form. 
It  usually  helps  to  determine  the  Kronecker  indices  (row  degrees)  and  the 
corresponding  numbers  pki  first.  We  will  now  consider  examples. 

Let  us  begin  with  the  simple  bivariate  process 


0  a 
0  1 


(12.1.17) 


with  a  ^  0.  Noting  that 


\A{L)\ 


'  1 

—aL 

0 

1  J 

1  and  A(L)adj  = 


1  aL 
0  1  ’ 
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the  final  equations  form  is  seen  to  be 


Vt 


0  a 
0  0 


ut. 


(12.1.18) 


To  find  the  echelon  representation,  we  first  determine  the  Kronecker  indices 
or  row  degrees  and  the  implied  pki  from  Definition  12.2.  The  first  row  of 
(12.1.17)  has  degree  pi  =  1  and  the  second  row  has  degree  P2  =  0.  Hence, 


Pn  =  1,  pi  2  =  0,  P21  =  1,  P22  =  0, 


so  that 


otn{L)  —  1  —  an ,iL,  ai2 (L)  —  0,  0:21  {L)  —  —a 21,0,  and  022 {L)  —  1. 


Thus,  the  echelon  form  is 


1  —  an pL  0 

—021,0  1 


Vt 


1  +  mii^iL  TO12, \L 

—021,0  1 


Ut- 


(12.1.19) 


The  unique  parameter  values  in  this  representation  corresponding  to  the  spe¬ 
cific  process  (12.1.17)  are  easily  seen  to  be 


011,1  —  o  21.0  —  71111.1  —  0  and  771-12,1  —  o. 


Thus,  in  this  particular  case,  the  final  equations  form  and  the  echelon  form 
coincide. 

As  another  example,  we  consider  a  3-dimensional  process  with  VARMA(2, 1) 
representation 


1  -  0i L  -62L 

0  1  -  03L  -  O4L2 

0  0 

1  —  rjiL  0 

=  0  1  -  772  L 

0  0 


0 


-o5l 

1 

0 

0 

1  -  773  L 


Vt 


ut- 


Using  (12.1.16),  its  final  equations  form  is  seen  to  be 


(1  —  0iL)(l  —  03L  —  04L2)i/t 


1  -  63L  -  04 L2 
0 
0 


x 


1  -  771 L 
0  1 
0 


62L 

1  -  0XL 
0 

0 

-  i?2  L 
0  1 


(1 

0 

0 


02  05  A2 

0§L  —  0i  05-b“ 

9iL)(i  -  e3L  -  e4L2) 


Ut 


-  mL 


(12.1.20) 
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which  is  easily  recognizable  as  a  VARMA(3,4)  structure  with  scalar  VAR 
operator. 

The  Kronecker  indices,  that  is,  the  row  degrees  of  (12.1.20)  are  (pi,P2,P‘d)  = 
(1,2,1)  and  the  implied  p/c,-riumbers  from  (12.1.14)  are  collected  in  the  fol¬ 
lowing  matrix: 


[Pfci]fc,i= 1,2,3 


111 
12  1 
12  1 


Consequently,  the  VAR  operator  of  the  echelon  form  becomes 

1  —  c*ii,iA  — <*12,iA  — <*i3,iA 

—  7*21,2  A2  1  —  022, lT  —  OL  22,2-b2  —  7*23,2  A2 

7*31,1  A  —032,0  —  032,1  A  1  —  033,1  A 


■  1 

0 

0  ' 

7*11,1 

C*12,l 

7*13,1 

0 

0 

0 

0 

1 

0 

- 

0 

7*22,1 

0 

A- 

7*21,2 

7*22,2 

7*23,2 

0 

—  7*32,0 

1 

_  7*31,1 

7*32,1 

7*33,1 

0 

0 

0 

(12.1.21) 

Hence,  in  the  echelon  representation, 


A0  = 


1  0  0 
0  10 
0  —032,0  1 


is  different  from  13,  if  o32,o  ^  0,  and,  thus,  M0  =  A0  is  also  not  the  identity 
matrix.  The  MA  operator  is 


1  +  77ln, lA  m12,\L  ?77'13,lA 

77721, lA  +  m2l,2A2  1  +  m22,lA  +  77722, 2A2  77723, lA  +  77723.2A2 

77731, iA  —032,0  +  77732,1  A  1  +  77733, iA 

or 


"  1 

0 

0  " 

"  77711, 1 

77712,1 

7771  3,1 

"  0 

0 

0  " 

0 

1 

0 

+ 

77721,1 

77722,1 

77723,1 

A  + 

77721,2 

77722,2 

77723,2 

.  0 

—  7*32,0 

1 . 

.  77731,1 

77732,1 

77733,1  . 

0 

0 

0 

(12.1.22) 

The  reader  may  be  puzzled  by  the  fact  that  the  last  element  in  the  second 
row  of  (12.1.21)  does  not  involve  a  term  with  first  power  of  A  while  such  a 
term  appears  in  (12.1.20).  This  model  form  shows  that  there  is  a  VAR.MA 
representation  equivalent  to  (12.1.20)  with  the  second  but  not  the  first  power 
of  A  in  the  last  operator  in  the  second  row  of  A(A).  The  fact,  that  there  always 
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exists  an  equivalent  echelon  representation  does  not  mean  that  there  is  always 
an  immediately  obvious  relation  between  the  coefficients  of  any  given  VAR.MA 
representation  and  its  equivalent  echelon  form.  However,  in  the  present  case 
it  is  fairly  easy  to  relate  the  representations  (12.1.20)  and  (12.1.21)/(12.1.22). 
Premultiplying  (12.1.20)  by  the  operator 


1  0  0 
0  1  85L 
0  0  1 

results  in  a  VAR  operator 

'  1  -  6>i L  -92L  0 

0  1  -(hi.  6aL2  0 

0  0  1 


(12.1.23) 


and  the  MA  operator  changes  accordingly.  Notice  that  the  operator  (12.1.23) 
has  constant  determinant  and,  of  course,  the  resulting  VAR.MA  model  is  equiv¬ 
alent  to  (12.1.20).  The  relation  between  its  coefficients  and  those  of  the  echelon 
representation  (12.1.21)/(12.1.22)  is  obvious: 


CCll,l  =  #1,  042,1  =  82,  #13,1  =  0, 

021,2  =  0,  022,1  =  #3)  022,2  =  #4  ,  023,2  =  0, 

031,1  =  032,0  =  032,1  =  033,1  =  0, 

and  the  relation  between  (12.1.22)  and  the  coefficients  of  (12.1.20)  is  also 
apparent.  Of  course,  if  the  zero  coefficients  are  known,  then  this  knowledge 
may  be  used  to  reduce  the  number  of  free  coefficients  in  the  echelon  form. 

In  this  example,  the  unrestricted  final  equations  form  has  3  AR  coefficients 
and  36  MA  coefficients.  Thus,  the  unrestricted  form  contains  39  parameters, 
apart  from  white  noise  covariance  coefficients.  In  contrast,  the  unrestricted 
echelon  form  (12.1.21)/(12.1.22)  has  only  23  free  parameters  and  is  therefore 
preferable  in  terms  of  parameter  parsimony.  Note  that,  in  practice,  the  true 
coefficient  values  are  unknown  and  we  pick  an  identified  structure,  for  exam¬ 
ple,  a  final  equations  form  or  an  echelon  form.  At  that  stage,  further  parameter 
restrictions  may  not  be  available.  Hence,  if  (12.1.20)  is  the  actual  data  gener¬ 
ation  process  we  may  pick  a  VARMA(3,4)  model  with  scalar  AR  operator  if 
we  decide  to  go  with  a  final  equations  representation  and  we  may  choose  the 
model  (12.1.21)/(12.1.22)  if  we  decide  to  use  an  echelon  form  representation. 
Obviously,  the  latter  choice  results  in  a  more  parsimonious  parameterization. 
As  mentioned  earlier,  for  estimation  purposes  the  more  parsimonious  repre¬ 
sentation  is  advantageous. 

Although  A0  =£  I  in  the  previous  example,  it  should  be  understood  that  in 
many  echelon  representations  Aq  =  =  IK.  In  particular,  if  the  row  degrees 

Pi  =  •  •  •  =  pk  =  P ,  all  pki  =  p,  i,  k  =  1, . . . ,  K,  and  the  echelon  form  is  easily 
seen  to  be  a  standard  VAR.MA  (p, p)  model  with  A0  =  M0  =  Ik-  We  are 
now  ready  to  turn  to  the  actual  estimation  of  the  parameters  of  an  identified 
VAR.MA  model  and  we  shall  discuss  its  Gaussian  likelihood  function  next. 
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12.2  The  Gaussian  Likelihood  Function 

For  maximum  likelihood  (ML)  estimation  the  likelihood  function  is  needed. 
We  will  now  derive  useful  approximations  to  the  likelihood  function  of  a  Gaus¬ 
sian  VARMA(p,  q)  process.  Special  case  MA  processes  will  be  considered  first. 


12.2.1  The  Likelihood  Function  of  an  MA(1)  Process 

Because  a  zero  mean  MA(1)  process  is  the  simplest  member  of  the  finite  order 
MA  family,  we  use  that  as  a  starting  point.  Hence,  we  assume  to  have  a  sam¬ 
ple  2/i , . . .  ,ut  which  is  generated  by  the  Gaussian,  -dimensional,  invertible 
MA(1)  process 

yt  =  ut  +  MiUt-i,  (12.2.1) 


where  ut  is  a  Gaussian  white  noise  process  with  covariance  matrix  Uu.  Thus, 


■»1 

u0 

Ui 

y  :  = 

.  Vt  . 

=  soil 

ut 

where 


SDli  := 


A'h  1K  0  ...  0  0 
0  Mi  IK  0  0 


(12.2.2) 


0  0  0  ...  Ah 

is  a  (KT  x  K(T  +  1))  matrix.  Using  that  wt  is  Gaussian  white  noise  and,  thus, 


u0 

Ui 


'  AA(0,  1t+ i  <8>  £u), 


ut 

if  follows  that 

y  ~  Af(0,Mi(IT+i  ®  T,„)9lf1) 


and  the  likelihood  function  is  seen  to  be 


i(Mi,su  |y) 

oc  \Wli(lT+i  ®  Zu)m[\-1/2  exp{-iy'[ajl1(/T+i  ®  T'u)9Jli]“1y}1 


(12.2.3) 
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where  oc  stands  for  “is  proportional  to”.  In  other  words,  we  have  dropped  a 
multiplicative  constant  from  the  likelihood  function  which  does  not  change 
the  maximizing  values  of  Mi  and  Su. 

It  is  inconvenient  that  this  function  involves  the  determinant  and  the  in¬ 
verse  of  a  ( KT  x  KT)  matrix.  A  simpler  form  is  obtained  if  Uq  is  set  to  zero, 
that  is,  the  MA(1)  process  is  assumed  to  be  started  up  with  a  nonrandom 
fixed  vector  uq  =  0.  In  that  case, 


y  =  OttiU, 


where 


Ik  0 
Mi  IK 


Th  ■= 


0  0 
0  0 


0  0  ...  Mi  1K 

( KTxKT ) 


and 


u  := 


Ml 

llT 

(KTx  1) 


(12.2.4) 


The  likelihood  function  is  then  proportional  to 


l0(Mi,Zu\y)  =  \Th(IT  ®  r„)9Jl,1|-1/2exp{-iy'[OT1(l7.  $  A.u)9H'1]-1y} 
=  |r„|-T/2exp{— iy'911'f 1 \It  ® 


where  it  has  been  used  that  UJti  =  1  and 


Tli1 


Ik  0 

-Mi  1K 

(-Mi) 2  -Mi 


0  0 
0  0 

0  0 


(~Mi)T~1  (- Mi)t~ 2  ...  -Mi  IK 
1k  0  ...O' 

-III  Ik  0 


—Hr -i  —IIt-2  ■  ■  ■  Ik  \ 

where  the  Hi  =  —(—Mi)1  are  the  coefficients  of  the  pure  VAR  representation 
of  the  process.  By  successive  substitution,  the  MA(1)  process  in  (12.2.1)  can 
be  rewritten  as 


t- 1 

Ut  +  ^^(—MiYyt-i  +  (—MiYuq  =  ut. 

i= 1 


(12.2.6) 
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Thus,  if  u o  =  0, 


t-i 

ut  =  yt  +  )lyt-u 

»= i 

from  which  the  last  expression  in  (12.2.5)  is  obtained. 

The  equation  (12.2.6)  also  shows  that,  for  large  t,  the  assumption  regarding 
Uq  becomes  inconsequential  because,  for  an  invertible  process,  M\  approaches 
zero  as  t  — »  oo.  The  impact  of  Uq  disappears  more  rapidly  for  processes  for 
which  M\  goes  to  zero  more  rapidly  as  t  gets  large.  In  other  words,  if  all 
eigenvalues  of  Mi  are  close  to  zero  or,  equivalently,  all  roots  of  clet (Ik  + 
M\  z)  are  far  outside  the  unit  circle,  then  the  impact  of  uq  is  lower  than 
for  processes  with  roots  close  to  the  unit  circle.  In  summary,  the  likelihood 
approximation  in  (12.2.5)  will  improve  as  the  sample  size  gets  large  and  will 
become  exact  as  T  — >  oo.  In  small  samples,  it  is  better  for  processes  with  roots 
of  det(lK+Miz)  far  away  from  the  unit  circle  than  for  those  with  roots  close  to 
the  noninvertibility  region.  Because  we  will  be  concerned  predominantly  with 
large  sample  properties  in  the  following,  we  will  often  work  with  likelihood 
approximations  such  as  Iq  in  (12.2.5). 


12.2.2  The  MA(qr)  Case 

A  similar  reasoning  as  for  MA(1)  processes  can  also  be  employed  for  higher 
order  MA  processes.  Suppose  the  generation  process  of  yt  has  a  zero  mean 
MA(<?)  representation 


Ut  =  Ut  +  Ml'Ut-l  H - h  MqUt-q. 


Then 


U-q+l 


y  =  ® Xq 


U0 

Ul 


ut 


where 


:  = 


Mq  Mq—l 
0  Mq 


Mi  IK  0 
M2  All  Ik 


0 

0 


0  0 


A'lq  .  .  .  M-2  Ml  1ft 


(12.2.7) 


(12.2.8) 
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is  a  (KT  x  K(T  +  q))  matrix  and  the  exact  likelihood  for  a  sample  of  size  T 
is  seen  to  be 

1{MU  . . . ,  Mq,  27w|y)  cx  \Mq{IT+q  ® 

x  exp{-§y'p,(Ir+,  ®  (12.2.9) 

Again  a  convenient  approximation  to  the  likelihood  function  is  obtained 
by  setting  u_g+i  =  •  •  •  =  Uq  =  0.  In  that  case,  the  likelihood  is,  apart  from  a 
multiplicative  constant, 


MMi, . . . ,  Mq ,  A’ujy) 


|A„|  T/2exp{-iy'[^  1(Jr®Z’,111)91t<31]y}, 

(12.2.10) 


where 


'  Ik  0 

Mi  /K 
M2  Mi 


mq  ■- 


Mq  Mq-\ 
0  Mq 


0  0  _ 
0  0 

0  0 


0  0  ...  Mq 


and,  hence, 


m-1 


Ik  0  ...  0 

-Mi  IK  0 

—IIt- i  —  nT-i  ■  ■  ■  Ik 


Mi  Ik 


(12.2.11) 


Here  the  IIl  are  the  coefficient  matrices  of  the  pure  VAR  representation  of 
the  process  yt.  Thus,  the  77,;  can  be  computed  recursively  as  in  Section  11.2 
of  Chapter  11. 

An  alternative  expression  for  the  approximate  likelihood  is  easily  seen  to 
be 

/0(A7i,...,M„A.u|y)  =  |A„rT/2exp|_l^u^-iMt|;  (12.2.12) 

where 


t-i 

Ut  =  yt-J2  niVt-i- 

i= 1 
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Again,  the  likelihood  approximation  will  be  quite  precise  if  T  is  reasonably 
large  and  the  roots  of  det (Ik  +  Miz  +  •  •  •  +  Mqzq)  are  not  close  to  the  unit 
circle. 

Although  we  will  work  with  likelihood  approximations  in  the  following,  it  is 
perhaps  worth  noting  that  an  expression  for  the  exact  likelihood  of  an  MA(q) 
process  can  be  derived  that  is  more  manageable  than  the  one  in  (12.2.9)  (see, 
e.g.,  Hillmer  &  Tiao  (1979),  Kohn  (1981)). 

12.2.3  The  VARMA(1, 1)  Case 

Before  we  tackle  general  mixed  VARMA  models,  we  shall  consider  the  simplest 
candidate,  namely  a  Gaussian  zero  mean,  stationary,  stable,  and  invertible 
VARMA(1,1)  process, 


Vt  =  Aiyt-i  T  ut  T  MiUt-i. 


(12.2.13) 


Assuming  that  we  have  a  sample  y±,. .  ■  ,yr,  generated  by  this  process  and 
defining 

IK  0  ...  0  0  " 

—A  i  IK  0  0 

—Ai  — Ai  • .  0  0 


%  := 


—Ap 

0 


Ap— 1 
—Ap 


0  0 
0  0 


(12.2.14) 


0  0  ’  ■ .  '-.Ik  0 

0  0  ...  — Ap  . . .  —A\  I X 

we  get 


2li 

["l 

+ 

-Axy0  1 
0 

=  m  1 

u0 

Ui 

.  VT  \ 

.  0  J 

ut 

Hence,  for  given,  fixed  presample  values  yo, 


y  1 


y  = 


Vt 


(It+i  <8>  <£,M)9Jt12t/1  1), 


(12.2.15) 


where 
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yo  := 


Myo 

0 


The  corresponding  likelihood  function,  conditional  on  yo,  is 
l{A1,Mi,  Eu\y,yo) 
oc  |2ir1Mi(/r+i®r„)»!'12i'1-1|-1/2 

x  exp{-i(y  -  arVomPi^r+i «  ZuMr'My  -  strVo)} 

=  |M1(/T+1®ru)sw'1|-1/2 

X  exp{-i(2tiy  -  y0)'[iHi(/r+i  ®  £u)9n'1\-1(X.1y  -  y0)},  (12.2.16) 
where  |2li|  =  1  has  been  used. 

With  the  same  arguments  as  in  the  pure  MA  case,  a  simple  approximation 
is  obtained  by  setting  uq  =  yo  =  0.  Then  we  get 

lo(Ai,M1,£u)  =  |A„|"T/2exp{-i(OTr1a1y),(7,.®  A-^OTr^iy} 

=  |A„|-T/2exp  (12.2.17) 

where 

t- 1 

ut  =  yt-^2  niVt-i  (12.2.18) 

i=i 

and  the  77^  are  the  coefficient  matrices  of  the  pure  VAR  representation,  that 
is,  for  the  present  case  7/,;  =  (— l)l_1(A7](  +  M\~lAi),  7  =  1,2,...  (see  Section 
11.3.1).  Note  that  in  writing  the  likelihood  approximation  l0  we  have  dropped 
the  conditions  y  and  yo  for  notational  simplicity. 

The  effect  of  starting  up  the  process  with  j/o  =  uq  =  0  is  quite  easily 
seen  in  (12.2.18),  namely,  for  observation  yt,  the  infinite  order  pure  VAR 
representation  is  truncated  at  lag  t  —  1.  Such  a  truncation  has  little  effect  if 
the  sample  size  is  large  and  the  roots  of  the  MA  operator  are  not  close  to  the 
unit  circle. 


12.2.4  The  General  VARMA(p,  q)  Case 

Now  suppose  a  sample  y±, . . . ,  yx  is  generated  by  the  Gaussian  77-dimensional, 
stable,  invertible  VARMA(p,  q)  process 

Aq (yt  ~  y)  =  Ai(yt_ i  —  y)  +  ■  ■  ■  +  Ap(yt_p  —  y) 

+Aoiti  +  MiUt-i  +  •  •  •  +  Alqut-q 


(12.2.19) 
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with  mean  vector  p  and  nonsingular  white  noise  covariance  matrix  Su.  No¬ 
tice  that  Aq  appears  as  the  coefficient  matrix  of  yt  and  of  ut  as  in  the  eche¬ 
lon  form.  Thus,  the  echelon  form  is  covered  by  our  treatment  of  the  general 
VARMAfp.  q)  case.  We  have  chosen  the  mean-adjusted  form  of  the  process 
because  this  form  has  certain  advantages  in  ML  estimation,  as  we  will  see 
later. 

Usually  some  elements  of  the  coefficient  matrices  will  be  zero  or  obey  some 
other  type  of  restrictions.  Therefore,  to  be  realistic,  we  define 

ao  :=  vec(A0)  and  j3  :=  vec[Ai, . . . ,  Ap,  Mi, . . . ,  Mq]  (12.2.20) 

and  assume  that  these  coefficients  are  linearly  related  to  an  (N  x  1)  parameter 
vector  7,  that  is, 


"  =R1  +  r  (12.2.21) 

for  a  suitable,  known  ( K2(p+q  +  l )  x  N)  matrix  R  and  a  known  K2(p+q+l)- 
vector  r.  For  example,  for  a  bivariate  ARM  A  ^(1,0)  process  with  Kronecker 
indices  p\  =  1  and  p2  =  0, 


1  —  cxn  iL  0 

“  £*21,0  1 


(yt 


y) 


1  -|-  Tn\\  \L  777-12 5 \L 
^21,0  1 


Ut 


or 

(yt  -  y) 
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1  0 

—  <221,0  1 
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1 

—  <221,0 
0 
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0 


<2ll,l  0 

0  0 

I"  mu,! 
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(yt 

mi2 
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-l  —  y)  + 
4  lut_i, 


l  o 

—  <221,0  1 


Ut 


<2ll,l 

0 

0 

0 

mu,! 

0 

m12,i 
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0  0  0  0 
-10  0  0 
0  0  0  0 
0  0  0  0 
0  10  0 
0  0  0  0 
0  0  0  0’ 
0  0  0  0 
0  0  10 
0  0  0  0 
0  0  0  1 
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7  = 


<*21,0 
<*11,1 
mu,! 
m  12,1 


and  r  = 


1 

0 

0 

1 

0 

0 


Similarly,  for  the  final  equations  form 


(1  -  cm L)(yt  -  n) 


1  +  m\\L 
rnziL 


1TI12L 
1  +  m22L 


or 


'10' 
0  1 

II 

cm  0 

0  ai 

(yt- i~y)  + 

'10' 
0  1 

Ut  + 

mn  mi2 

m2 1  m22 

we  get 


ut- 1, 


<*o  — 


,  (3  = 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ql 

0 

0 

0 
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0 

0 

0 

0 
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0 

1 
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Ql 
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m2 1 
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0 

0 
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1 
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0 
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m22  \ 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 
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1 

7  = 


cm 

mu 

m.2i  , 

mi2 

m22 


and  r  = 


1 

0 

0 

1 

0 
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The  likelihood  function  is  a  function  of  /i,  7,  and  Su.  Its  exact  form,  given 
fixed  initial  values  y—p+ 1, . . . ,  yo,  can  be  derived  analogously  to  the  previously 
considered  special  cases  (see  Problem  12.4  and  Hillmer  &  Tiao  (1979)).  Here 
we  will  just  give  the  likelihood  approximation  obtained  by  assuming 


y-v+i  —  y  =  ••■  =  yo  —  y  =  u_q+1  =  ■ 


u0  =  0. 
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Apart  from  a  multiplicative  constant,  we  get 
l0(li,7,Eu)  =  |ru|_T/2exp  j 
where 


«t(/b  7) 


t-  1 

Mv,  7)  =  (2/t  -  m)  -  uih)(yt-i  -  m), 

i=l 


(12.2.22) 


(12.2.23) 


with  the  17;  (7) ’s  being  again  the  coefficient  matrices  of  the  pure  VAR  rep¬ 
resentation  of  yt  .  We  have  indicated  that  these  matrices  are  determined  by 
the  parameter  vector  7.  Formally  the  likelihood  approximation  has  the  same 
appearance  as  in  the  special  cases.  Of  course,  the  ut’s  are  now  potentially 
more  complicated  functions  of  the  parameters. 

It  is  perhaps  worth  noting  that  the  uniqueness  or  identification  problem 
discussed  in  Section  12.1  is  reflected  in  the  likelihood  function.  If  the  model  is 
parameterized  in  a  unique  way,  for  instance,  in  final  equations  form  or  echelon 
form,  the  likelihood  function  has  a  locally  unique  maximum.  This  property 
is  of  obvious  importance  to  guarantee  unique  ML  estimators.  Note,  however, 
that  the  likelihood  function  in  general  has  more  than  one  local  maximum.  A 
more  detailed  discussion  of  the  properties  of  the  likelihood  function  can  be 
found  in  Deistler  &  Potscher  (1984). 

The  next  section  focuses  on  the  maximization  of  the  approximate  likeli¬ 
hood  function  (12.2.22)  or,  equivalently,  the  maximization  of  its  logarithm, 


lnl0{n,1,Zu)  =  — ^ln|Au|  -  (12.2.24) 

Z  t=  1 


12.3  Computation  of  the  ML  Estimates 

In  the  pure  finite  order  VAR  case  considered  in  Chapters  3  and  5,  we  have 
obtained  the  ML  estimates  by  solving  the  normal  equations.  In  the  presently 
considered  VARMA(p,  q)  case,  we  may  use  the  same  principle.  In  other  words, 
we  determine  the  first  order  partial  derivatives  of  the  log-likelihood  function 
or  rather  its  approximation  given  in  (12.2.24)  and  equate  them  to  zero.  We 
will  obtain  the  normal  equations  in  Section  12.3.1.  It  turns  out  that  they 
are  nonlinear  in  the  parameters  and  we  discuss  algorithms  for  solving  the 
ML  optimization  problem  in  Section  12.3.2.  The  optimization  procedures  are 
iterative  algorithms  that  require  starting-up  values  or  preliminary  estimates 
for  the  parameters.  A  possible  choice  of  initial  estimates  is  proposed  in  Section 
12.3.4.  One  of  the  optimization  algorithms  involves  the  information  matrix 
which  is  given  in  Section  12.3.3.  An  example  is  discussed  in  Section  12.3.5. 
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12.3.1  The  Normal  Equations 


In  order  to  set  up  the  normal  equations  corresponding  to  the  approximate 
log-likelihood  given  in  (12.2.24),  we  derive  the  first  order  partial  derivatives 
with  respect  to  all  the  parameters  /Lt,7,  and  Su. 


dlnl0  _  ^  /  v-t-i dut  ^ 


-1 


d\i’ 


t=i 

T 


t= 1 


t- 1 


i=l 


dlnlo  ,  „_idut 

-r  =  -}^utK 


dy 


t= i 


dy 


(12.3.1) 


(12.3.2) 


A  recursive  formula  for  computing  the  dut  /dy  is  given  in  the  following  lemma. 


Lemma  12.1 

Suppose  i-i  =  0  and  let 


ut  —  Ut  —  A0  1[A1yt_1  +  •  •  •  +  Api/t-p  +  MiUt-i  +  •  •  •  +  Mqut-q],  (12.3.3) 


a0  :=  vec(A0), 

(3  :=  vec[Ai, . . . ,  Ap,  Mlt . . . ,  Mq], 
and  suppose 


a0 

(3 


R'y  +  r, 


(12.3.4) 


where  R  is  a  known  ( K2(p  +  q  +  1)  x  N)  matrix,  r  is  a  known  K2(p  +  q  +  1)- 
dimensional  vector,  and  7  is  an  (N  x  1)  vector  of  unknown  parameters.  Then, 
defining  duo/y  =  du-i/dj'  =  ■■■  =  du-q+ \/dy  =  0  and  yo  =  ■■■  = 
V—p+ 1  t/o  =  ■  ■  ■  =  =  0, 


dut 

dy 


for  t  =  1 , . . 


{(A)  M-Aj/t-i  +  •  •  •  +  Ap'yt-f 


+  •  •  •  +  Mqut-q\y  ®  A0  1}[1k 2  :  0  :  •  •  •  :  0]1? 
-[(y't- 1.  •  •  •  >  y't-p, ut~ i>  •  •  •  > u't_q)  <s>  A1] [° :  3 K2(p+q)\R 


—  A-1 


Ml 


dut-i 

dy 


-q 
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t-q 


dy 


(12.3.5) 


,  T. 


Replacing  yt  with  yt  —  y  in  this  lemma,  the  expression  in  (12.3.5)  can  be 
used  for  recursively  computing  the  dut/d 7'  required  in  (12.3.2). 

Proof: 


dut 

dy 


~ [(Aiyt-i  +  •  •  •  +  Apyt-p  +  MiUt-i  +  •  •  •  +  MqUt-q)'  ®  3k] 
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x  ay 

—  [{lit- 1)  •  •  •  >  Vt-pi  ut- 1;  •  •  ■  1  ut—q)  ®  ^0  ] 

5vec[Ai, . .  .,Ap,Mi, . .  .,Mq\ 

x  ay 


~ A0  [A\,  — ,  Ap, Mi, . . . ,  Mq] 


The  lemma  follows  by  noting  that 

i— 1\  o _ t  4-1 


’  yt-i  j 

d 

yt-p 

/ay 

ut- 1 

.  Ut-q  \ 

(12.3.6) 


<9vec(A0  )  <9vec(A0  )  dag  _  [M-iv„ri][(  ,n.  .  nl  „ 

dy  -  da'0  }  ][^2  ‘ 0  ‘ • 0]R 


(12.3.7) 


(see  Rule  (9)  of  Appendix  A.  13). 


The  partial  derivatives  of  the  approximate  log-likelihood  with  respect  to 
the  elements  of  ZJU  are 


hi lo  _  T  ,  1  !  „_i 

—  -y^u  +  y  I  2_^  UtUt  J  ^u 


(12.3.8) 


dSu  2  “  2 

(see  Problem  12.5).  Setting  this  expression  to  zero  and  solving  for  Eu  gives 
1  T 

Zu(v,l)  = 


(12.3.9) 


Substituting  for  Uu  in  (12.3.1)  and  (12.3.2)  and  setting  to  zero  results  in  a 
generally  nonlinear  set  of  normal  equations  which  may  be  solved  by  numerical 
methods.  Before  we  discuss  a  possible  algorithm,  it  may  be  worth  pointing 
out  that  by  substituting  Eu(/j,  7)  for  Su  in  In  /q,  we  get 

lnl0(n,y  = 


T  ~  TI< 

=  —  —  In  |  Su  (/i,  7)  |  —  ■ 


(12.3.10) 

Thus,  instead  of  maximizing  In  1 0  we  may  equivalently  minimize 

ln|-£„(/i,7)|  or  |5„(/z,7)|.  (12.3.11) 
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12.3.2  Optimization  Algorithms 

The  problem  of  optimizing  (minimizing  or  maximizing)  a  function  arises  not 
only  in  ML  estimation  but  also  in  various  other  contexts.  Therefore,  general 
algorithms  have  been  developed.  Following  Judge  et  al.  (1985,  Section  B.2), 
we  will  give  a  brief  introduction  to  so-called  gradient  algorithms  and  then 
address  the  specific  problem  at  hand.  With  the  objective  in  mind  that  we 
want  to  find  the  coefficient  values  that  minimize  —  lnZ0  or  In  \Su{p,,  -y)  | ,  we 
assume  that  the  problem  is  to  minimize  a  twice  continuously  differentiable, 
scalar  valued  function  h( 7),  where  7  is  some  ( N  x  1)  vector. 

Given  a  vector  7,  in  the  parameter  space,  we  are  looking  for  a  direction 
(vector)  d  in  which  the  objective  function  declines.  Then  we  can  perform  a 
step  of  length  s,  say,  in  that  direction  which  will  take  us  downhill.  In  other 
words,  we  seek  an  appropriate  step  direction  d  and  a  step  length  s  such  that 

h(li  +  sd)  <  A(7i).  (12.3.12) 


If  d  is  a  downhill  direction,  a  small  step  in  that  direction  will  always  decrease 
the  objective  function.  Thus,  we  are  seeking  a  d  such  that  h( 7^  +  sd)  is  a 
decreasing  function  of  s,  for  s  sufficiently  close  to  zero.  In  other  words,  d 
must  be  such  that 


0  > 


dh{  7j  +  sd) 

dh{  7) 

’  dhi  +  sd) 

" 

5/1(7) 

ds 

s—0 

dy 

ds 

5=0- 

dy 

Using  the  abbreviation 


h;  := 


dh(  7) 
<97 


for  the  gradient  of  h( 7)  at  7^  a  possible  choice  of  d  is 


d  =  -Dihi, 


where  Dj  is  any  positive  definite  matrix.  With  this  choice  of  d, 
h'd  =  —  h'ZUhj  <  0 

if  h,  0.  Because  the  gradient  is  zero  at  a  local  minimum  of  the  function, 
we  hope  to  have  reached  the  minimum  once  h,:  =  0  and,  hence,  d  =  0.  The 
general  form  of  an  iteration  of  a  gradient  algorithm  is  therefore 

7i+i  =  li  —  SiDihi,  (12.3.13) 

where  s,  denotes  the  step  length  in  the  i-tli  iteration  and  Di  is  a  positive 
definite  direction  matrix.  The  name  “gradient  algorithm”  stems  from  the  fact 
that  the  gradient  h;  is  involved  in  the  choice  of  the  step  direction.  Many  such 
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algorithms  have  been  proposed  in  the  literature  (see,  for  example,  Judge  et  al. 
(1985,  Section  B.2)).  They  differ  in  their  choice  of  the  direction  matrix  Di  and 
the  step  length  s, . 

To  motivate  the  choice  of  the  U,;  matrix  that  will  be  considered  in  the  ML 
algorithm  presented  below,  we  expand  the  objective  function  h{ 7)  in  a  Taylor 
series  about  7,;  (see  Appendix  A. 13,  Proposition  A. 3), 

Hi)  «  Hli)  +  K{l  -  li )  +  |(7  “  li)'Hi(i  -  Ji),  (12.3.14) 

where 

H  •=  ^ 

d^fdy  ^ 

is  the  Hessian  matrix  of  second  order  partial  derivatives  of  h(i ),  evaluated  at 
7j.  If  h(l)  were  a  quadratic  function,  the  right-hand  side  of  (12.3.14)  were 
exactly  equal  to  h{l)  and  the  first  order  conditions  for  a  minimum  would  result 
by  taking  first  order  partial  derivatives  of  the  right-hand  side  and  setting  to 
zero: 

h'  +  Hi( 7  -  7 i)'  =  0 


7  =  7*-  Hi  lhi- 

Thus,  if  Hi)  were  a  quadratic  function,  starting  from  any  vector  7,,  we  would 
reach  the  minimum  in  one  step  of  length  Si  =  1  by  choosing  the  inverse  Hessian 
as  the  direction  matrix.  In  general,  if  Hi)  is  n°t  a  quadratic  function,  then 
the  choice  -D;  =  H~x  is  still  reasonable  once  we  are  close  to  the  minimum. 
Recall  that  a  positive  definite  Hessian  is  the  second  order  condition  for  a  local 
minimum.  Therefore,  the  inverse  Hessian  qualifies  as  a  direction  matrix.  A 
gradient  algorithm  with  the  inverse  Hessian  as  the  direction  matrix  is  called 
a  Newton  or  Newton-Raphson  algorithm. 

From  the  previous  subsection,  we  know  that  the  first  order  partial  deriva¬ 
tives  of  our  objective  function  —  lnZ0  are  quite  complicated  and,  thus,  finding 
the  Hessian  matrix  of  second  order  partial  derivatives  is  even  more  compli¬ 
cated.  Therefore  we  approximate  the  Hessian  by  an  estimate  of  the  informa¬ 
tion  matrix, 

(12-3-15) 

which  is  the  expected  value  of  the  Hessian  matrix.  The  estimate  of  T(l) 
will  be  denoted  by  1(7).  A  computable  expression  will  be  given  in  the  next 
subsection.  Because  the  true  parameter  vector  7  is  unknown,  T{li)  is  used  as 
an  estimate  of  21(7)  in  the  i-tli  iteration  step.  Hence,  for  given  mean  vector  p 
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and  white  noise  covariance  matrix  Su,  we  get  a  minimization  algorithm  with 
i-tli  iteration  step 


7i+i  =  li  ~  s&ili)  1 


<9(—  In  l0) 

dj 

7i_ 

(12.3.16) 


This  algorithm  is  called  the  scoring  algorithm. 

As  it  stands,  we  still  need  some  more  information  before  we  can  execute 
this  algorithm.  First,  we  need  a  starting-up  vector  ,y1  for  the  first  iteration. 
This  vector  should  be  close  to  the  minimizing  vector  to  ensure  that  T( 7:)  is 
positive  definite  and  we  make  good  progress  towards  the  minimum  even  in 
the  first  iteration.  We  will  consider  one  possible  choice  in  Section  12.3.4. 

Second,  we  have  to  choose  the  step  length  s,:.  There  are  various  possible 
alternatives  (see,  e.g.,  Judge  et  al.  (1985,  Section  B.2)).  Because  we  are  just 
interested  in  the  main  principles  of  the  algorithm,  we  will  ignore  the  problem 
here  and  choose  Sj  =  1. 

Third,  the  algorithm  provides  an  ML  estimate  of  7,  conditional  on  some 
given  Su  matrix  and  mean  vector  y,  because  both  the  information  matrix  and 
the  gradient  vector  involve  these  quantities.  They  are  usually  also  unknown. 
As  in  the  pure  finite  order  VAR  case,  it  can  be  shown  that  the  sample  mean 


V  = 


1 

T 


is  an  estimator  for  ji  which  has  the  same  asymptotic  properties  as  the  ML 
estimator.  Therefore,  ML  estimation  of  7  and  Uv  is  often  done  conditionally 
on  /z  =  y.  In  other  words,  the  sample  mean  is  subtracted  from  the  data  before 
the  VARMA  coefficients  are  estimated. 

There  are  different  ways  to  handle  the  unknown  Su  matrix.  From  (12.3.9), 
we  know  that 


~  1  . 

£u(v,i)  = 

t=i 

Therefore,  one  possibility  is  to  use  :=  Su(y,  7  J  in  the  i-th  iteration.  Equiv¬ 
alently,  the  minimization  algorithm  can  be  applied  to  In  |  Su  ( y ,  7)  | . 

A  number  of  computer  program  packages  contain  exact  or  approximate 
ML  algorithms  which  may  be  used  in  practice.  The  foregoing  algorithm  is 
just  meant  to  demonstrate  some  basic  principles.  Modifications  in  actual  ap¬ 
plications  may  result  in  improved  convergence  properties.  Slow  convergence 
or  no  convergence  at  all  may  be  the  consequence  of  working  with  VARMA 
orders  or  Kronecker  indices  which  are  larger  than  the  true  ones  and,  hence, 
with  an  overparameterized  model. 
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12.3.3  The  Information  Matrix 


In  the  scoring  algorithm  described  previously,  an  estimate  of  the  information 
matrix  is  needed.  To  see  how  that  can  be  obtained,  we  consider  the  second 
order  partial  derivatives  of  —  In  Zen 


<92(-lnZ0) 

d'ydj' 


d 


'  T 

E 


dj'  (see  (12.3.2)) 


\  '  du't  ±  dut 
dj  u  fry' 


{u'tZ-1®!) 


d  vec[du't/  dj] 
d"/' 


Taking  the  expectation  of  this  expression,  the  last  term  vanishes  because 
E(ut)  =  0  and  (g>  /  is  independent  of 


<9vec  [du't/dj\ 
d'y' 


as  this  term  does  not  contain  current  yt  or  ut  variables  (see  Lemma  12.1). 
Hence, 


E 


d2(-lnZ0) 

d~/dj' 


t=  1 


i m 

d~f 


ZJ 


-l 

u 


dut 

d~f' 


Estimating  the  expected  value  in  the  usual  way  by  the  sample  average  gives 
an  estimator 

1  y-  du't  r-i  dut 
T  fr[d~i  “  fry' 


for 


E 


du’t  y- idut 

dj  u  dj' 


These  considerations  suggest  the  estimator 

9  ^  dut(y,j)'  ^dutiyn) 

I(t)  =  L.  g7  -gy- 


(12.3.17) 


for  the  information  matrix  1(7).  In  the  i-th  iteration  of  the  scoring  algorithm, 
we  evaluate  this  estimator  for  7  =  7,.  The  quantities  dut/dj'  may  be  obtained 
recursively  as  in  Lemma  12.1  to  make  this  estimator  operational. 

If  7  is  the  true  parameter  value,  the  asymptotic  information  matrix  equals 
plim  l(/y)/T.  Thus,  if  we  have  a  consistent  estimator  7  of  7,  1(7 )/T  is  a 
consistent  estimator  of  the  asymptotic  information  matrix,  that  is, 

Zah)  =  Plim  1(7) /T. 


(12.3.18) 
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In  Section  12.4,  we  will  see  that  the  inverse  of  this  matrix,  if  it  exists,  is  the 
asymptotic  covariance  matrix  of  the  ML  estimator  for  7.  If  a  nonidentified 
structure  is  used,  this  problem  is  reflected  in  the  asymptotic  information  ma¬ 
trix  being  singular.  Hence,  it  is  important  at  this  stage  to  have  an  identified 
version  of  a  VARMA  model. 


12.3.4  Preliminary  Estimation 

The  coefficients  of  a  VARMA  (p,  q)  model  in  standard  form, 

Vt  =  AiVt-i  +  •  •  •  +  Apyt-p  +  Ut  +  M\Ut-\  +  •  •  •  +  MqUt~q , 

could  be  estimated  by  multivariate  LS,  if  the  lagged  ut  were  given.  We  as¬ 
sume  that  the  sample  mean  y  has  been  subtracted  previously.  It  is  therefore 
neglected  here.  In  deriving  preliminary  estimators  for  the  other  parameters, 
the  idea  is  to  fit  a  long  pure  autoregression  first  and  then  use  estimated  resid¬ 
uals  in  place  of  the  true  residuals.  Hence,  we  fit  a  VAR(n)  model 

'll 

Vt  =  ^2  Ui{n)yt-i  +  ut(n ), 

i= 1 

where  n  is  larger  than  p  and  q.  From  that  estimation,  we  compute  estimated 
residuals 


ut(n)  :=yt-Y.  (12.3.19) 

i= 1 


where  77 \  (n)  are  the  multivariate  LS  estimators.  Then  we  set  up  a  multivariate 
regression  model 


Y  =  [A:  M]Xn  +  U°,  (12.3.20) 

where  Y  :=  [yu  . . .  ,yT\,  A  :=  [Au  . . . ,  Ap\,  M  :=  [Mi,. .  .,Mq\, 

Vt 


Xn  - —  [^  0 ,n  j  •  •  •  5  Yj1—  1 ,71,]  with  1  — 


Vt—p+i 

ut(n) 


ut-q+i(n) 


(K(p  +  q)  x  1) 


and  U°  is  a  (K  x  T)  matrix  of  residuals.  Usually  restrictions  will  be  imposed 
on  the  parameters  A  and  M  of  the  model,  for  instance,  if  the  model  is  given 
in  final  equations  form.  Additional  restrictions  may  also  be  available.  Suppose 
the  restrictions  are  such  that  there  exists  a  matrix  R  and  a  vector  7  satisfying 
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vec [A  :  M]  =  Rry.  (12.3.21) 

Applying  the  vec  operator  to  (12.3.20)  and  substituting  Rj  for  vec[A  :  M] 
gives 

vec(y)  =  PC  ®  lK)R-y  +  vec (U°)  (12.3.22) 

and  the  LS  estimator  of  7  is  known  to  be 

7 (n)  =  [R'{XnX'n  ®  l^R^R'iXn  ®  7*)  vec(U)  (12.3.23) 

(see  Chapter  5,  Section  5.2).  This  estimator  may  be  used  as  an  initial  vector 
71  in  the  ML  algorithm  described  in  the  previous  subsections. 

Using  this  estimator,  a  new  set  of  residuals  may  be  obtained  as 

vec (U°)  =  vec (Y)  -  (. X'n  <g>  IK)Rj(n) 

which  may  be  used  to  obtain  a  white  noise  covariance  estimator 

Su(n)  =  U°U°'/T.  (12.3.24) 

This  estimator  may  be  used  in  place  of  Su  in  the  initial  round  of  the  iterative 
optimization  algorithm  described  earlier. 

Alternatively,  instead  of  the  LS  estimator  (12.3.23),  we  may  use  an  EGLS 
estimator, 

7 (n)  =  [R\XnX’n  ®  Su)R]~lR!{Xn  ®  Su)  vec(F), 

with  Xu  (n)  in  place  of  Su  or  a  white  noise  covariance  matrix  estimator  based 
on  the  residuals  Ut{n). 

The  echelon  form  of  a  VARMA(p,  q)  process  may  be  of  the  more  general 
type 


Ao yt  —  Aiyt-i  +  •  •  •  +  Apyt-p  +  A^ut  +  +  •  •  •  +  Mqut~q,  (12.3.25) 


where  Aq  is  a  lower  triangular  matrix  with  unit  diagonal.  To  handle  this  case, 
we  proceed  in  a  similar  manner  as  in  the  standard  case  and  substitute  the 
residuals  ut  (n)  for  the  lagged  ut  and  for  current  residuals  from  other  equations. 
In  other  words,  in  the  fc-th  equation  we  substitute  estimation  residuals  for 
Uit,i  <  k.  Because  Aq  is  the  coefficient  matrix  for  both  yt  and  ut ,  we  define 


■■=  Kr- 


»>T-l,n].  where  Yt,n  := 


Vt+ 1  -  Ut+1  (n) 
Y.n 


and  we  pick  a  restriction  matrix  Rc  and  a  vector  ryc  such  that 


Rclc  =  vec[f/<-  -  A0,A,M\. 


Hence, 


476 


12  Estimation  of  VAR.MA  Models 


vec(V)  =  (X%  ®  1k)Rc 7c  +  vec(C/°) 
and  the  LS  estimator  of  7C  becomes 

7cW  =  ®  ^if)vec(y). 

The  starting-up  estimator  of  Uu  is  then  obtained  from  the  residuals  of  this 
regression.  It  is  possible  that  the  VAR.MA  process  corresponding  to  these  coef¬ 
ficients  is  unstable  or  noninvertible.  Especially  in  the  latter  case,  modifications 
are  desirable  (see  Hannan  &  Kavalieris  (1984),  Hannan  &  Deistler  (1988)). 

To  see  more  clearly  what  is  being  done  in  this  preliminary  estimation 
procedure,  let  us  look  at  an  example.  Suppose  the  bivariate  VARMA(1, 1) 
echelon  form  model  from  (12.1.19)  with  Kronecker  indices  (pi,P2)  =  (1,0)  is 
to  be  estimated: 


2/1,  t  —  041,12/1,*-!  +  ul,t  +  TOll,lul,t-l  +  Wll2,lW2,t-l, 

2/2, t  =  «21,02/l,t  —  O21,0Ml,t  +  U2  ,t  =  «21,o(Z/l,t  —  U1  ,t)  +  u2  ,*• 


(12.3.26) 


We  assume  that  the  sample  mean  has  been  removed  previously.  The  parame¬ 
ters  in  the  first  equation  are  estimated  by  applying  LS  to 


2/1,1  j 

• 

_ 

2/1,0  «i,o  (n)  u2,0(n) 

041,1 
nil  1,1 

+ 

r  «i,i 

2/1,  T  \ 

_  2/1, T-l  Ul,T-l(n)  U2,T-l(n)  \ 

.  m12,l  . 

L  Hi ,T  _ 

or,  using  obvious  notation,  to 


2/(1)  =  -^(i)7i  +  u{i)- 

Here  the  Wjjt(n)  are  the  residuals  from  the  estimated  long  VAR  model  of  order 
n.  The  LS  estimator  of  71  is  71  =  {X'^X^)-1  X'^y^y 
Similarly,  <221,0  is  estimated  by  applying  LS  to 


"  2/2,1  1 

2/1,1  “  wi,i(ra)  1 

112,1 

. 

= 

. 

021,0  + 

.  2/2,  T 

.  2/1, T  -  «i,T(n)  _ 

.  M2,T 

In  this  case,  it  would  be  possible  to  use  the  residuals  of  the  first  regression 
instead  of  the  wijt(n)  which  are  the  residuals  from  the  long  VAR.  However, 
we  have  chosen  to  use  the  latter  in  the  preliminary  estimation  procedure. 

In  the  foregoing,  we  have  so  far  ignored  the  problem  of  choosing  presample 
values  for  the  estimation.  Two  alternative  choices  are  reasonable.  Either  all 
presample  values  are  replaced  by  zero  or  some  yt.  values  at  the  beginning  of 
the  sample  are  set  aside  as  presample  values  and  the  presample  values  for  the 
residuals  are  replaced  by  zero. 

The  initial  estimators  obtained  in  the  foregoing  procedure  can  be  shown  to 
be  consistent  under  general  conditions  if  n  goes  to  infinity  with  the  sample  size 
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(see  Hannan  &  Kavalieris  (1984),  Hannan  &  Deistler  (1988),  Poskitt  (1992)). 
We  will  discuss  the  situation,  where  VAR  processes  of  increasing  order  are 
fitted  to  a  potentially  infinite  order  process,  in  Chapter  15  and  therefore  we 
do  not  give  details  here. 


12.3.5  An  Illustration 


We  illustrate  the  estimation  procedure  using  the  income  (y\ )  and  consump¬ 
tion  (j/2)  data  from  File  El.  As  in  previous  chapters,  we  use  first  differences  of 
logarithms  of  the  data  from  1960  to  1978.  In  this  case,  we  subtract  the  sample 
mean  at  an  initial  stage  and  denote  the  mean-adjusted  income  and  consump¬ 
tion  variables  by  yu  and  j/2t ,  respectively.  We  assume  a  VARMA(2, 2)  model 
in  echelon  form  with  Kronecker  indices  p  =  (pi,P2)  =  (0,2)  [ARMA#(0, 2)], 


2/i,  t 

_ 

'  0 

0 

2/1, t-i 

+ 

'  0 

0 

2/1, t-2 

+ 

Ml,t 

2/2, t  _ 

0 

«22,1 

.  2/2, t-l 

0 

O!  22,2 

2/2, t-2  _ 

.  l<2-t  . 

0 

0 

Ml, t-l 

0 

0 

Ml, t-2 

_  m2 1,! 

m22,i 

.  U2,t-1  _ 

_  rn2 !,2 

1TI22, 2 

_  M2,t— 2 

(12.3.27) 


In  the  next  chapter,  it  will  become  apparent  why  this  model  is  chosen.  It  im¬ 
plies  that  the  first  variable  (income)  is  white  noise  (yu  =  «k).  Given  the  sub¬ 
set  VAR  models  of  Chapter  5  (Table  5.1),  this  specification  does  not  appear  to 
be  totally  unreasonable.  The  second  equation  in  (12.3.27)  describes  consump¬ 
tion  as  a  function  of  lagged  consumption,  lagged  income  (ui^-i  =  yi,t-i),  and 
a  moving  average  term  involving  lagged  residuals  w2,t  ■ 

Eventually  we  use  a  sample  from  1960.2  ( t  =  1)  to  1978.4  ( t  =  75),  that  is, 
T  =  75.  In  the  preliminary  estimation  of  the  model  (12.3.27),  we  estimate  a 
VAR(8)  model  first,  using  8  presample  values.  Then,  using  two  more  presample 
values,  we  run  a  regression  of  y2t  on  its  own  lags  and  lagged  uu( 8).  More 
precisely,  the  regression  model  is 


2/2,11 

2/2,  T 


2/2,10  2/2,9  mi,io(8)  M2, 10(8)  mi, 9(8)  M2, 9  (8) 

2/2, T-l  2/2, T-2  Mi,T-l(8)  M2,T-l(8)  Mi,T-2(8)  M2,T-2(8) 

m2  11 


7 


M2,t 


where  7  :=  (a22,i,  022,2,  to2i,i,  m22,i,  m2i>2,  m22,2)'.  In  this  particular  case, 
we  could  have  substituted  yu  for  mk(8)  because  the  model  implies  yu  =  Uit- 
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We  have  not  done  so,  however,  but  we  have  used  the  residuals  from  the  long 
autoregression.  The  resulting  preliminary  parameter  estimates 

7i  =  (6:22,1(1),  •  ■  -j ^22,2(1))' 


are  given  in  Table  12.1. 


Table  12.1.  Iterative  estimates  of  the  income/consumption  system 


i 

7* 

«22,1 

022,2 

777-21, 1 

777-22,1 

777-21,2 

77122,2 

|£«(7i)l  x  108 

1 

0.020 

0.395 

0.296 

-0.367 

0.181 

-0.224 

0.872564 

2 

-0.178 

0.492 

0.331 

-0.527 

0.175 

-0.015 

0.942791 

3 

0.072 

0.117 

0.305 

-0.589 

0.191 

0.065 

0.779788 

4 

0.202 

0.078 

0.311 

-0.731 

0.146 

0.147 

0.776107 

5 

0.219 

0.063 

0.312 

-0.744 

0.142 

0.158 

0.775959 

6 

0.224 

0.062 

0.313 

-0.748 

0.140 

0.159 

0.775952 

10 

0.225 

0.061 

0.313 

-0.750 

0.140 

0.160 

0.775951 

We  use  these  estimates  to  start  the  scoring  algorithm.  For  our  particular 
example,  the  i-tli  iteration  proceeds  as  follows: 

(1)  Compute  residuals 

ut(i)  =  yt~  M(i)yt-i  -  A2(i)yt- 2  -  Mi(i)ut._i(i)  -  M2{i)ut-2{i) 
recursively,  for  t  =  1,  2, . . . ,  T,  with  U-i(i)  =  Uo(i)  =  y~  1  =  yo  =  0  and 


Ai(i)  = 


M±(i)  = 


0  0 
0  022,1  (*) 


A2  (i)  = 


0  0 
0  022,2  (*) 


0  0 

fn-21,1  (i)  fh22,i(i) 


M2(i)  = 


0  0 

m21>2(i)  m22j2(i) 


(2)  Compute  the  partial  derivatives  dut/d 7  recursively  as 


du, 

dy 


7  W  =  ■ 


0  0  0  0  0  0 

2/2, t— 1  2/2,4— 2  u2>t-i(i)  u2yt~i{i)  ultt-2(i)  u2tt-2{i) 


for  t  =  1,  2, . . . ,  T,  with 

du- 1  du0 
-5yW=  ay  (>)  =  »■ 
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(3)  Compute 

£«(7  *)  = 

%  i)  = 


and 


Ed'u/f  .  .  ~  . _  .  i  3zq  ,  . 

t  a^W^WJ-^W, 


3(—  In  Ip) 

6>7' 


dzq 

37' 


(z). 


(4)  Perform  the  iteration  step 

3(-  In  ?0) 

^  =r4.  ' 

Some  estimates  obtained  in  these  iterations  are  also  given  in  Table  12.1  to¬ 
gether  with  |i?u(7j)|-  After  a  few  iterations  the  latter  quantity  approximately 
reaches  its  minimum  and,  thus,  —  In  1$  obtains  itsqiiinimum.  After  the  tenth 
iteration  there  is  not  much  change  in  the  qq  and  \Su('ji)\  in  further  steps.  We 
work  with  710  in  the  following. 

The  determinantal  polynomial  of  the  MA  operator  for  z  =  10  is 

|72  +  Mi(10)z+M2(10)z2|  =  1  +  TO22, 1(10)2:  +  77122, 2(10)z2 

=  1  -  .7502  + T6O22 


7*+i  =  7*  -  Ali)  1 


which  has  roots  that  are  clearly  outside  the  unit  circle.  Thus,  the  estimated 
MA  operator  is  invertible.  Also,  the  determinant  of  the  estimated  VAR  poly¬ 
nomial, 

\h~  A1(10)z-!2(10)z2|  =  1-522)i(10)^-522,2(10)z2 

=  1  -  .2252  -  .O6I22, 

is  easily  seen  to  have  its  roots  outside  the  unit  circle.  Hence,  the  estimated 
VARMA  process  is  stable  and  invertible. 

Generally,  computing  the  ML  estimates  is  not  always  easy.  Therefore,  other 
estimation  methods  were  also  proposed  in  the  literature  (e.g.,  Koreisha  & 
Pukkila  (1987),  van  Overschee  &  DeMoor  (1994),  Kapetanios  (2003)). 


12.4  Asymptotic  Properties  of  the  ML  Estimators 

12.4.1  Theoretical  Results 

In  this  section,  the  asymptotic  properties  of  the  ML  estimators  are  given. 
We  will  not  prove  the  main  result  but  refer  the  reader  to  Hannan  (1979), 
Dunsmuir  &  Hannan  (1976),  Hannan  &  Deistler  (1988),  and  Kohn  (1979)  for 
further  discussions  and  proofs. 
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Proposition  12.1  ( Asymptotic  Properties  of  ML  Estimators) 

Let  yt  be  a  /^-dimensional,  stationary  Gaussian  process  with  stable  and  in¬ 
vertible  VAR.MA (p.  q)  representation 

Ao{yt~p)  =  p)+-  ■  ■+Ap(yt-p—i-i)+AoUt+MiUt-i+-  ■  ■ +MqUt~q , 

(12.4.1) 

where  Ut  is  Gaussian  white  noise  with  nonsingular  covariance  matrix  Eu.  Sup¬ 
pose  the  VAR  and  MA  operators  are  left-coprime  and  either  in  final  equations 
form  or  in  echelon  form  with  possibly  linear  restrictions  on  the  coefficients  so 
that  the  coefficient  matrices  A0,  A±, . . . ,  Ap,  Mi, . . . ,  Mq  depend  on  a  set  of 
unrestricted  parameters  7  as  in  (12.2.21).  Let  p,  7,  and  Eu  be  the  ML  es¬ 
timators  of  p,  7,  and  A„,  respectively,  and  denote  vech(A„)  and  vech(Au) 
by  a  and  cr,  respectively.  Then  all  three  ML  estimators  are  consistent  and 
asymptotically  normally  distributed, 

[  R  —  R  1  /  \  0  °1\ 

Vf  7-7  0,  0  0  ,  (12.4.2) 

<r  —  cr  y  0  0  Ag.  J 

where 


Ep  =  A(1)-1A/(1)AUA/(1),A(1)'-1, 


with  dut/d~y'  as  given  in  Lemma  12.1,  and 
=  2E)J-(AU  AU)D+ 

with  Dj-  =  (D'A.Di<-)_1D'K  and  Y)k  is  the  ( K 2  x  \K (K  +  1))  duplication  ma¬ 
trix.  The  covariance  matrix  in  (12.4.2)  is  consistently  estimated  by  replacing 
the  unknown  quantities  by  their  ML  estimators.  ■ 

Some  remarks  on  this  proposition  may  be  worthwhile. 

Remark  1  The  results  of  the  proposition  do  not  change  if  the  ML  estimator 
Ji  is  replaced  by  the  sample  mean  y  and  7  and  cr  are  ML  estimators  conditional 
on  y,  that  is,  7  and  <x  are  obtained  by  replacing  p  by  y  in  the  ML  algorithm. 
One  consequence  of  this  result  is  that  asymptotically  the  sample  mean  is  a 
fully  efficient  estimator  of  p.  ■ 

Remark  2  The  proposition  is  formulated  for  final  equations  or  echelon  form 
VARMA  models.  Its  statement  remains  true  for  other  uniquely  identified 
structures.  ■ 


Remark  3  Because  the  covariance  matrix  of  the  asymptotic  distribution  in 
(12.4.2)  is  block-diagonal,  the  estimators  of  p,  7,  and  A„  are  asymptotically 
independent.  ■ 
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Remark  4  Much  of  the  proposition  remains  valid  even  if  yt  is  not  normally 
distributed.  In  that  case  the  estimators  obtained  by  maximizing  the  Gaussian 
likelihood  function  are  quasi  ML  estimators.  If  Ut  is  independent  standard 
white  noise  (see  Chapter  3,  Definition  3.1),  7  and  y  maintain  their  asymptotic 
properties.  The  covariance  matrix  of  <7  may  be  different  from  the  one  given 
in  Proposition  12.1.  ■ 

Remark  5  The  results  of  the  proposition  remain  valid  under  general  condi¬ 
tions  if  instead  of  the  ML  estimator  7  an  estimator  is  used  which  is  obtained 
from  one  iteration  of  the  scoring  algorithm  outlined  in  Section  12.3.2,  starting 
from  the  preliminary  estimator  of  Section  12.3.4.  Thus,  one  possible  approach 
to  estimating  the  parameters  of  a  VARMA  model  is  to  compute  the  sample 
mean  y  first  and  use  that  as  an  estimator  of  y.  Then  the  preliminary  estima¬ 
tor  for  7  may  be  computed  as  described  in  Section  12.3.4  and  that  estimator 
is  used  as  the  initial  vector  in  the  optimization  algorithm  of  Section  12.3.2. 
Then  just  one  step  of  the  form  (12.3.16)  is  performed  with  s,  =  Si  =  1.  The 
resulting  estimators  72  and  Uu(y,  72)  may  then  be  used  instead  of  7  and  Su 
in  Proposition  12.1.  Under  general  conditions,  they  have  the  same  asymptotic 
distributions  as  the  actual  ML  estimators.  Of  course,  this  possibility  is  a  com¬ 
putationally  attractive  way  to  estimate  the  coefficients  of  a  VARMA  model. 
In  general,  the  small  sample  properties  of  the  resulting  estimators  are  not  the 
same  as  those  of  the  ML  estimators,  however.  ■ 

Remark  6  Because  often  the  final  equations  form  involves  more  parameters 
than  the  echelon  form,  unrestricted  estimation  of  the  former  may  result  in 
inefficient  estimators.  Intuitively,  if  we  start  from  the  echelon  form  and  de¬ 
termine  the  corresponding  final  equations  form,  the  coefficients  of  the  latter 
are  seen  to  satisfy  restrictions  that  could  be  imposed  to  obtain  more  efficient 
estimators.  ■ 

In  the  following  sections,  we  will  occasionally  be  interested  in  the  asymp¬ 
totic  distribution  of  the  coefficients  of  the  standard  representation  of  the  pro¬ 
cess, 

(yt  ~  y)  —  Ai(yt-\  —  y)  +  ■  ■  ■  +  Ap(yt-p  —  y)  +ut  +  M\Ut~i  +  •  •  •  +  Mqut~q. 

(12.4.3) 

The  coefficients  are  functions  of  7  and  their  asymptotic  distributions  follow 
in  the  usual  way.  Let 

a.  :=  vec[Ai, . . . ,  Ap\  and  m  :=  vec[Mi, . .  ,,Mq], 
then 


OL 

«( 7) 

m 

.  m(7)  _ 

The  ML  estimators  are 
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OL 

a  (7) 

m 

.  m(7)  . 

They  are  consistent  and  asymptotically  normal, 


Vt 


a 

m 


/ 

V 1 

\ 

0,  Uja  j  = 

dy' 

Ay 

da.'  dm'  1 

dm 

dy  ’  dy  \ 

V 

.  dy'  . 

/ 

(12.4.4) 


If  Aq  =  1  k  ,  a  and  m  will  often  be  linearly  related  to  y  and  we  get  the  following 
corollary  of  Proposition  12.1. 

Corollary  12.1.1 

Under  the  conditions  of  Proposition  12.1,  if 


a 

m 


Ry  +  r, 


Vt 


a 

in 


a 

m 


>Af(0,  RSyR1) 


and  a  and  m  are  asymptotically  independent  of  y,  /j,  and  er.  ■ 

The  remarks  following  the  proposition  also  apply  for  the  corollary.  For 
illustrative  purposes,  consider  the  bivariate  ARMA/j(0, 1)  model, 


1  0 
0  1  —  O221T 


or 


2 It  = 

In  this  case, 


0  0 

0  0122,1 


a  = 


R  = 


0 
0 
0 

022,1 

0  0  0 

0  0  0 

0  0  0 

1  0  0 

0  0  0 

0  10 

0  0  0 

0  0  1 


Vt  = 


1  0 
17121,1  T  1  +  17122,1  L 


Vt- 1  +  ut  + 


0  0 
17121,1  17122,1 


Ut 


Ut- 1- 


0 

17721,1 

022,1 

,  m  = 

0 

1  7  = 

17721,1 
_  17722,1  _ 

_  17722,1  J 

and  r  = 


(12.4.5) 
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If  the  VAR.MA  model  is  not  in  standard  form  originally,  we  premultiply 
by  Aq  1  to  get 

Ut  ~  M  =  A0  1Ai(yt-i  —  y)  +  ■  ■  ■  +  A0  1Ap(yt_p  —  n) 


+  Ut  +  Aq  1Al\Ut-i  +  •  •  •  +  Aq  1  AIqut-q.  (12.4.6) 

In  this  case,  it  is  more  reasonable  to  assume  that 

A,  :=  vec[Au  Al5 . . . ,  Ap,  Mi, . . . ,  Mj  (12.4.7) 

is  linearly  related  to  7,  say, 

(30  =  Rj  +  r.  (12.4.8) 

Then  it  follows  for 

a  ■-  vec[Ag  1A1, . .  ,,A^1AP }  =  vec(Ag  1[A1, . .  .,AP])  (12.4.9) 

and 

m  :=  vec(Ag  1[M1, . .  .,Mq])}  (12.4.10) 


that 


da  1 

■  da  1 

1 

-  da  ~ 

dy 

Wo 

II 

©I 

CO 

Wo 

dm 

elm 

dy 

dm 

.  dy  \ 

-  Wo  \ 

-  Wo  - 

Hence,  we  need  to  evaluate  da/d(3'0  and  dm/df3'0  to  obtain  the  asymptotic 
covariance  matrix  of  the  standard  form  coefficients. 


da 

a/3'0  "  {  Kp ' 


I  An 


W 


( 

Mil 

\ 

+ 

. 

• 

\ 

) 

d  vec(A0  1) 
dPo 


—  {lap  ®  A0  1)[0  :  1k2p  '■  0] 


/ 

\ 

. 

• 

®1k 

V 

J 

((Ag1)'^1) 


<9vec(A0) 

W 


(see  Rule  9  of  Appendix  A.  13) 


—  (7/cp  ®  A0  x)[0  :  lK2p  :  0] 


/[(A^A,)'- 

UM. 


[1K2  :  0]. 


(12.4.11) 


A  similar  expression  is  obtained  for  dm/d(3'0.  This  result  is  summarized  in 
the  next  corollary. 
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Corollary  12.1.2 

Under  the  conditions  of  Proposition  12.1,  if  (30  is  as  defined  in  (12.4.7)  and 
satisfies  the  restrictions  in  (12.4.8)  and  a  and  m  are  the  coefficients  of  the 
standard  form  VAR.MA  representation  defined  in  (12.4.9)  and  (12.4.10),  re¬ 
spectively,  with  ML  estimators  a  and  m,  then 


Vt( 

OL 

OL 

)  O.Srai  = 

'  Ha  ' 

V 

m 

m 

)  \ 

I  In, 

R^R'[H'a  :  H'Sj 


where 

Hn 


(K2px  K2(p  +  q+l)) 


da 

Wo 

{Ikp^Aq^I  ^  :  IK2p  ■  1 

( K2pxK 2)  ( K2pxK2q ) 


f 

!  o 

\ 

V 

a 

1  o 

®A0X  j 

and 


:= 


(I<2qx  K2(p+q+  1)) 
d(30 


( 


—  (^Kq  ®  A0  1)[0  :  Ix^ql 


(WMi)' 

V  L  (WW  J 


1  ^0  I  Yii2  ■  0]- 


Again  an  example  may  be  worthwhile.  Consider  the  following  bivariate 
ARMA#(2, 1)  process  with  some  zero  restrictions  placed  on  the  coefficients 
(see  also  Problem  12.3): 


1  —  ol  1 1 5 1 L  —  c\l  h^2  0 

1  0 

—  0^21,0  —  &21,lL  1  —  &22,lL 

Ut  — 

—  021,0  1  +  77122,1  L 

(12.4.12) 

or 


1 

0  ' 

Oll,l 

0 

On, 2 

0  ' 

—a2l,o 

1 

Vt  = 

_  021,1 

022,1 

Dt-i  + 

0 

0 

1 

0  " 

'  0 

0 

+ 

—  o  21,0 

1 

Ut  + 

0 

77122,1  _ 

Hence, 
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—£*21,0 

0 

1 

<*n,i 

<*21,1 

0 

«22,1 

<*11,2 

0 

0 

0 

0 

0 

0 

m22,i 


R=  ! 


0  0  0  0  0  0 

-100000 
0  0  0  0  0  0 

0  0  0  0  0  0 

0  1  0  0  0  0 

0  0  1  0  0  0 

0  0  0  0  0  0 

0  0  0  1  0  0 

0  0  0  0  1  0 

0  0  0  0  0  0 

0  0  0  0  0  0 

0  0  0  0  0  0 

0  0  0  0  0  0 

0  0  0  0  0  0 

0  0  0  0  0  0 

0  0  0  0  0  1 


£*21,0 
<*11,1 
c*  21,1 

7  = 

C*22,l 
£*11,2 
.  m22,i  _ 

Furthermore, 


A-1  - 
A0  — 


1 

or1 

i 

0  ' 

£*21,0 

i 

_  <*21,0 

1 

a.  =  vec[A0  Ax,  A0  A2] 


£*ll,l£*21,0  +  £*21,1  £*22,1  £*11,2<*21,0 


£*ll,l 

£*n,l£*21,0  +  £*21,1 
0 


£*11,2C*21,0 


and 


486 


12  Estimation  of  VAR.MA  Models 


m  =  vec[A0  lM\ ]  =  vec 


0  0 
0  77122, 1 


Consequently, 


0 

0 

0 

77122,1 


da 

dy' 


=  H.yR  = 


0  1  0  0  0  0 

<341,1  <321,0  1  0  0  0 

0  0  0  0  0  0 

0  0  0  1  0  0 

0  0  0  0  1  0 

<311,2  0  0  0  <321,0  0 

0  0  0  0  0  0 

0  0  0  0  0  0 


and 


dm 

6*7' 


=  HmR  = 


0  0  0  0  0  0 
0  0  0  0  0  0 
0  0  0  0  0  0 
0  0  0  0  0  1 


(see  also  Problem  12.7). 


(12.4.13) 


(12.4.14) 


12.4.2  A  Real  Data  Example 


In  their  general  form,  the  results  may  look  more  complicated  than  they  usually 
are.  Therefore,  considering  our  income/consumption  example  from  Section 
12.3.5  again  may  be  helpful.  For  the  VARMA(2, 2)  model  with  Kronecker 
indices  (0,2)  given  in  (12.3.27),  the  parameters  are 

7  =  (<322,1,  <322,2,  777.21,1,  77722,1,  77721,2,  77722, 2)7- 


The  ML  estimates  are  given  in  Table  12.1.  Using  7  =  710  from  that  table,  an 
estimate  of  1(7)  is  obtained  from  the  iterations  described  in  Section  12.3.5, 
that  is,  we  use  X( 710)  =  T{y).  The  square  roots  of  the  diagonal  elements  of 
Z(7)-1  are  estimates  of  the  standard  errors  of  the  elements  of  7.  Giving  the 
estimated  standard  errors  in  parentheses,  we  get 


.225  (.252) 
.061  (.166) 
.313  (.090) 
-.750  (.274) 
.140  (.141) 
.160  (.233) 


(12.4.15) 
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As  mentioned  in  Remark  5  of  Section  12.4.1,  an  alternative,  asymptotically 
equivalent  estimator  is  obtained  by  iterating  just  once.  In  the  present  example 
that  leads  to  estimates 


-.178 

(.165) 

.492 

(.133) 

.331 

(.099) 

-.527 

(.172) 

.175 

(.127) 

-.015 

(.152) 

(12.4.16) 


These  estimates  are  somewhat  different  from  those  in  (12.4.15).  However, 
given  the  sampling  variability  reflected  in  the  estimated  standard  errors,  the 
differences  in  most  of  the  parameter  estimates  are  not  substantial. 

Under  a  two-standard  error  criterion,  only  two  of  the  coefficients  in 
(12.4.15)  are  significantly  different  from  zero.  As  a  consequence,  one  may 
wish  to  restrict  some  of  the  coefficients  to  zero  and  thereby  further  reduce  the 
parameter  space.  We  will  not  do  so  at  this  stage  but  consider  the  estimates 
of  a  and  m  implied  by  7  given  in  (12.4.15)  (see,  however,  Problem  12.10): 


a  =  vec[Ai,  A2] 


0 

0 

0 

,225(.252) 

0 

0 

0 

.061(.166) 


m  =  vec[M1;  M2 } 


0 

.313(.090) 

0 

—  ,750(.274) 

0 

.140(.141) 

0 

.160(.233) 


(12.4.17) 


The  standard  errors  are,  of  course,  not  affected  by  adding  a  few  zero  elements. 
A  more  elaborate  but  still  simple  computation  becomes  necessary  to  obtain 
the  standard  errors  if  Aq  ^  I k  (see  Corollary  12.1.2). 


12.5  Forecasting  Estimated  VARMA  Processes 

With  respect  to  forecasting  with  estimated  processes,  in  principle,  the  same 
arguments  apply  for  VARMA  models  that  have  been  put  forward  for  pure 
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VAR  models.  Suppose  that  the  generation  process  of  a  multiple  time  series  of 
interest  admits  a  VARMA(p,  q)  representation, 


yt  —  n  —  —  /i)  +  •  •  •  +  Ap(yt_p  —  y)  +  ut  +  +  •  •  •  +  Mqut-q, 

(12.5.1) 

and  denote  by  yt(h)  the  ft.-step  ahead  forecast  (with  nonzero  mean)  at  origin  t 
given  in  Section  11.5,  based  on  estimated  rather  than  known  coefficients.  For 
instance,  using  the  pure  VAR  representation  of  the  process, 

h— 1  oo 

Vt(h)  =  M  Hiiytih  -  i)  -  y)  +  nt{yt+h-i  ~  V)-  (12.5.2) 

i—  1  i—h 

For  practical  purposes,  one  would,  of  course,  truncate  the  infinite  sum.  For 
the  moment  we  will,  however,  consider  the  infinite  sum.  For  this  predictor, 
the  forecast  error  is 


Vt+h  -  Vt(h)  =  [yt+h  -  Vt{h )]  +  [yt(h)  -  yt(h)}, 


where  yt.(h)  is  the  optimal  forecast  based  on  known  coefficients  and  the  two 
terms  on  the  right-hand  side  are  uncorrelated  as  the  first  one  can  be  written 
in  terms  of  us  with  s  >  t  and  the  second  one  contains  ys  with  s  <  t,  if  the 
parameter  estimators  are  based  on  ys  with  s  <  t  only.  Thus,  the  forecast  MSE 
becomes 

Sy(h)  =  MSE [yt(h)]  +  MSE [yt(h)  -  yt(h)\ 

=  EyW  +  E[yt(h)  -  yt{h)}\yt{h)  -  yt(h)]' .  (12.5.3) 

Formally,  this  is  the  same  expression  that  was  obtained  for  finite  order  VAR 
processes  and,  using  the  same  arguments  as  in  that  case,  we  approximate  the 
MSE [yt(h)  -  yt{h)}  by  Q(h)/T,  where 


Q{h)  =  E 


dytjh)  dyt(h)' 
drj'  v  drj 


(12.5.4) 


77  is  the  vector  of  estimated  coefficients,  and  is  its  asymptotic  covariance 
matrix.  If  ML  estimation  is  used  and 


V  = 


a 

m 


where  a  =  vec[Ai, . . . ,  Ap]  and  m  =  vec[Mi, . . . ,  Mq],  we  have  from  Proposi¬ 
tion  12.1  and  Corollaries  12.1.1  and  12.1.2, 


Erf  — 


0  Ar 
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Thus, 


dytjh)  dytjh)' 

dr i'  v  drj 


dytjh)  dyt{h)' 
dy'  M  dy 


dytjh) 

d[ot' .  m']  [a] 


dytjh)' 


d 


a 

m 


Hence,  in  order  to  get  an  expression  for  fljh)  we  need  the  partial  derivatives 
of  ytjh)  with  respect  to  y,  a ,  and  m.  They  are  given  in  the  next  lemma. 

Lemma  12.2 

If  yt  is  a  process  with  stable  and  invertible  VARMA  (p,  q)  representation 
(12.5.1)  and  pure  VAR  representation 


OO 

Vt  =  y  +  ^2  Ihijyt-i  -  y)  +  Ut, 

i= 1 


we  have 


dytjh) 

dy' 


Ik~E  11, 


for  h  =  1, 


+  . 


dytjh) 
d[a' ,  m'] 


with 


h-l 

^.jUi  jh  ~  i)  ~  /')'  ®  Ik] 

i- 1 


dvecjlli) 
d[a)  m'] 


\  dytjh  —  i) 

+  Til — 7 - V 

*7^  a[a,m'J 


OO 

+  ^2[jyt+h-i  -y)'  ®IK] 

i—h 


dvecjlli) 
d [a.',  m']  ’ 


for  h  =  1,2... 


dvecjlli) 
d [a/,  m'] 


i—l 

=  -  ^[tf'(M ®  JM j] 

3=0 


0  lKq  ®  J' 
IKp  ®J'  0 


where  H,  M,  and  J  are  as  defined  in  Chapter  11,  Section  11.3.2,  (11.3.13).  In 
other  words,  H ,  M,  and  J  are  defined  so  that  —  YIl  =  JM'if.  ■ 

The  proof  of  this  lemma  is  left  as  an  exercise  (see  Problem  12.8).  The 
formulas  given  in  this  lemma  can  be  used  for  recursively  computing  the  partial 
derivatives  of  ytjh)  with  respect  to  the  VARMA  coefficients  for  h=  1,2,...  . 

An  estimator  of  fljh)  is  obtained  by  replacing  all  unknown  quantities  by 
their  respective  estimators  and  truncating  the  infinite  sum  or,  equivalently, 
replacing  yt  —  y  by  zero  for  t  <  0.  Denoting  the  resulting  estimated  partial 
derivatives  by 


dytjh) 

dy' 


dytjh) 
d[a' ,  m']  ' 


and 
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an  estimator  for  fi(h)  is 


m  =  -  £ 


t= 1 


dyt(h)  ~  %(/i)'  ,  %(/i)  -  dyt{h)' 

^ - r  W7 — : - rr ^[  =  1 


ctyx' 


I  ^ 


dy  9[a/,m/] 


a 

m 


(12.5.5) 


where  and 


Vr 


are  estimators  of  S-^  and 


respectively  (see  Corollaries  12.1.1  and  12.1.2  for  the  latter  matrix).  An  esti¬ 
mator  of  the  forecast  MSE  matrix  (12.5.3)  is  then 

Ey(h)  =  £y(h)  +  ^f2(h),  (12.5.6) 

where  the  estimator  £y(h)  is  again  obtained  by  replacing  unknown  quantities 
by  their  respective  estimators. 

With  these  results  in  hand,  forecast  intervals  can  be  set  up,  under  Gaussian 
assumptions,  just  as  in  the  finite  order  VAR  case  discussed  in  Chapters  2  and 
3. 


12.6  Estimated  Impulse  Responses 

As  mentioned  in  Section  11.7.2,  the  impulse  responses  of  a  VARMA(p,  q) 
process  are  the  coefficients  of  pure  MA  representations.  For  instance,  if  the 
process  is  in  standard  form,  the  forecast  error  impulse  responses  are 

4>t  =  JAiH  (12.6.1) 

with  J,  A,  and  H  as  defined  in  Section  11.3.2  (see  (11.3.10)).  Other  quantities 
of  interest  may  be  the  elements  of  0i  =  where  P  is  a  lower  triangular 
Choleski  decomposition  of  Su,  the  white  noise  covariance  matrix.  Also  fore¬ 
cast  error  variance  components  and  accumulated  impulse  responses  may  be 
of  interest.  All  these  quantities  are  estimated  in  the  usual  way  from  the  es¬ 
timated  coefficients  of  the  process.  For  example,  =  JAlH,  where  A  is 
obtained  from  A  by  replacing  the  Ai  and  M3  by  estimators  Ai  and  Mj.  The 
asymptotic  distributions  of  the  estimated  quantities  follow  immediately  from 
Proposition  3.6,  which  is  formulated  for  the  finite  order  VAR  case.  The  only 
modifications  that  we  have  to  make  to  accommodate  the  VAR.MA(p,  q)  case 
are  to  replace  a  by 
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/3  :=  vec[Ai, . . . ,  Ap,  Mlt . . . ,  Mq] 


OL 

m 


replace  by  and  specify 

r)  vp< 

Gi  = 


avec®=(i?'®j)avec(A) 


80 

=  {H'®J) 


80 


i- i 


^(A')i-1_m(g>  A" 


rt=0 


8  vec(A) 

80 


=  Y  H' {A')i~l~m  <g>  JAm  J' . 


m= 0 


(12.6.2) 


With  these  modifications  of  Proposition  3.6,  the  asymptotic  distributions  of  all 
the  quantities  of  interest  are  available.  Of  course,  all  the  caveats  of  Proposition 
3.6  apply  here  too.  In  principle,  structural  impulse  responses,  as  discussed  in 
Chapter  9,  may  be  of  interest  as  well.  They  are  typically  not  based  on  VARMA 
models,  however. 


12.7  Exercises 

Problem  12.1 
Are  the  operators 


'  1  -  0.5 L 

0.3L  " 

and 

'  1  -  0.2L  1.3L  -  0.44L2  ' 

0 

1 

0.5L  1  +  0.2L 

left-coprime?  (Hint:  Show  that  the  first  operator  is  a  common  factor.) 
Problem  12.2 

Write  the  bivariate  process 


'  1 - 0L  0 

'1-0 L  O' 

L2  1  —  0  L 

Ut  — 

0  L  1 

in  final  equations  form  and  in  echelon  form. 

Problem  12.3 

Show  that  (12.4.12)  is  an  echelon  form  representation. 

Problem  12-4 

Derive  the  likelihood  function  for  a  general  Gaussian  VAR.MA(p,  q)  model 
given  fixed  but  not  necessarily  zero  initial  vectors  y~p+ 1, . . .  ,yo-  Do  not  as¬ 
sume  that  U-q+i  =  •  •  •  =  Uq  =  0! 

Problem  12.5 

Identify  the  rules  from  Appendix  A  that  are  used  in  deriving  the  partial 
derivatives  in  (12.3.8). 
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Problem  12.6 

Suppose  that  In  |U„(/7,  7)|  given  in  (12.3.11)  is  to  be  minimized  with  respect 
to  7.  Show  that  the  resulting  normal  equations  are 

<9  In  1 24(^7)  |  _  2  ^  ,  v,-1dut 

dj'  ~  t  Ut  u  dy ' 

Thus,  the  normal  equations  are  equivalent  to  those  obtained  from  the  log- 
likelihood  function. 

Problem  12.7 

Consider  the  bivariate  VARMA(2, 1)  process  given  in  (12.4.12)  and  set  up 
the  matrices  Ha  and  Hm  according  to  their  general  form  given  in  Corollary 
12.1.2.  Show  that  UaR  and  HIUR  are  identical  to  the  matrices  specified  in 
(12.4.13)  and  (12.4.14),  respectively. 

Problem  12.8 

Prove  Lemma  12.2.  (Hint:  Use  Rule  (8)  of  Appendix  A.  13.) 

Problem  12.9 

Derive  the  asymptotic  covariance  matrices  of  the  impulse  responses  and  fore¬ 
cast  error  variance  components  obtained  from  an  estimated  VARMA  process. 
(Hint:  Use  the  suggestion  given  in  Section  12.6.) 

Problem  12.10 

Consider  the  income /consumption  example  of  Section  12.3.5  and  determine 
preliminary  and  full  ML  estimates  for  the  parameters  of  the  model 

2/i, t  1  =  [  0  0  1  r  2/i,t-2 

1/2, t  0  0:22,2  _  2/2, 7-2 

0  0  1  [  uM_i 

m  21,1  771-22,1  772,1—  1 

0  0  1  r  Mr, t- 2 

_  77721,2  0  J  [  772, (-2 
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Specification  and  Checking  the  Adequacy  of 
VARMA  Models 


13.1  Introduction 

A  great  number  of  strategies  has  been  suggested  for  specifying  VARMA  mod¬ 
els.  There  is  not  a  single  one  that  has  become  a  standard  like  the  Box  &  Jenkins 
(1976)  approach  in  the  univariate  case.  None  of  the  multivariate  procedures  is 
in  widespread  use  for  modelling  moderate  or  high-dimensional  economic  time 
series.  Some  are  mainly  based  on  a  subjective  assessment  of  certain  character¬ 
istics  of  a  process  such  as  the  autocorrelations  and  partial  autocorrelations. 
A  decision  on  specific  orders  and  constraints  on  the  coefficient  matrices  is 
then  based  on  these  quantities.  Other  methods  rely  on  a  mixture  of  statis¬ 
tical  testing,  use  of  model  selection  criteria  and  personal  judgement  of  the 
analyst.  Again  other  procedures  are  based  predominantly  on  statistical  model 
selection  criteria  and,  in  principle,  they  could  be  performed  automatically  by 
a  computer.  Automatic  procedures  have  the  advantage  that  their  statistical 
properties  can  possibly  be  derived  rigorously.  In  actual  applications,  some  kind 
of  mixture  of  different  approaches  is  often  used.  In  other  words,  the  expertise 
and  prior  knowledge  of  an  analyst  will  usually  not  be  abolished  in  favor  of 
purely  statistical  procedures.  Models  suggested  by  different  types  of  criteria 
and  procedures  will  be  judged  and  evaluated  by  an  expert  before  one  or  more 
candidates  are  put  to  a  specific  use  such  as  forecasting.  The  large  amount 
of  information  in  a  number  of  moderately  long  time  series  makes  it  usually 
necessary  to  condense  the  information  considerably  before  essential  features 
of  a  system  become  visible. 

In  the  following,  we  will  outline  procedures  for  specifying  the  final  equa¬ 
tions  form  and  the  echelon  form  of  a  VARMA  process.  We  do  not  claim  that 
these  procedures  are  superior  to  other  approaches.  They  are  just  meant  to 
illustrate  what  is  involved  in  the  specification  of  VARMA  models.  The  speci¬ 
fication  strategies  for  both  forms  could  be  turned  into  automatic  algorithms. 
On  the  other  hand,  they  also  leave  room  for  human  intervention  if  desired. 
In  Section  13.4,  some  references  for  other  specification  strategies  are  given 
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and  model  checking  is  discussed  briefly  in  Section  13.5.  A  critique  of  VARMA 
modelling  is  given  in  Section  13.6. 


13.2  Specification  of  the  Final  Equations  Form 

13.2.1  A  Specification  Procedure 

Historically,  procedures  for  specifying  final  equations  VARMA  representations 
were  among  the  earlier  strategies  for  modelling  systems  of  economic  time  series 
(see,  for  example,  Zellner  &  Palm  (1974),  Wallis  (1977)).  The  objective  is  to 
find  the  orders  p  and  q  of  the  representation 

a(L)yt  =  M(L)ut,  (13.2.1) 

where 

a(L)  :=  1  —  aqL  —  •  •  •  —  apLp 
is  a  (1  x  1)  scalar  operator, 


M(L)  :=  1K  +  MiL  +  •  •  •  +  MqLq 

is  a  ( K  x  K)  matrix  operator  and  it  is  assumed  that  the  process  mean  has 
been  removed  in  a  previous  step  of  the  analysis. 

If  a  A'-dimensional  system  yt  =  ( y\t , . . .  ,yKt)'  has  a  VARMA  represen¬ 
tation  of  the  form  (13.2.1),  then  it  follows  that  each  component  series  has  a 
univariate  ARMA  representation 

ot{L)ykt  =  mk(L)vkt,  k  =  1, . . . ,  K, 

where  rnk{L )  is  an  operator  of  degree  at  most  q  because  the  fc-th  row  of 
M(L)ut  is 


mkl(L)uit  H - 1-  mkK{L)uKt- 

In  other  words,  it  is  a  sum  of  MA(g)  processes  which  is  known  to  have  an 
MA  representation  of  degree  at  most  q  (see  Proposition  11.1).  Thus,  each 
component  series  of  yt  has  the  same  AR  operator  and  an  MA  operator  of 
degree  at  most  q.  In  general,  at  least  one  of  the  component  series  will  have 
MA  degree  q  because  a  reduction  of  the  MA  order  of  all  component  series 
requires  a  very  special  set  of  parameters  which  is  not  regarded  as  likely  in 
practice.  This  fact  is  used  in  specifying  the  final  form  VARMA  representation 
by  first  determining  univariate  component  models  and  then  putting  them 
together  in  a  joint  model.  Specifically,  the  following  specification  strategy  is 
used. 


STAGE  I:  Specify  univariate  models 
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ak{L)ykt  =  mk(L)vkt 
for  the  components  of  yt-  Here 

•  1  &kiE  ■  o^kpkL p 

is  of  order  pk, 

mk(L )  :=  1  +  mkiL  H - b  mkqkLqk 

is  of  order  qkl  and  vkt.  is  a  univariate  white  noise  process.  ■ 

The  Box  &  Jenkins  (1976)  strategy  for  specifying  univariate  AR.MA  mod¬ 
els  may  be  used  at  this  stage.  Alternatively,  some  automatic  procedure  or 
criterion  such  as  the  one  proposed  by  Hannan  &  Rissanen  (1982)  or  Poskitt 
(1987)  may  be  applied. 

STAGE  II:  Determine  a  common  AR  operator  a(L)  for  all  component  pro¬ 
cesses,  specify  the  corresponding  MA  orders  and  choose  the  degree  q  of  the 
joint  MA  operator  as  the  maximum  of  the  individual  MA  degrees  obtained  in 
this  way.  ■ 

At  this  stage,  a  common  AR  operator  may,  for  example,  be  obtained  as 
the  product  of  the  individual  operators,  that  is, 

a(L)  =  ai(L)  ■  ■  ■  ax(L). 

In  this  case,  the  fc-th  component  process  is  multiplied  by 

K 

II  ®i(L) 

and  a(L)  has  degree  p  =  while  the  MA  operator 

K 

mk(L)  =  m.k{L)  ai(L) 

i=l,i^k 

has  degree 

K 

Qk  +  ^2  Pi ■ 

i=l,i^k 

The  joint  MA  operator  of  the  VARMA  representation  (13.2.1)  is  then  assumed 
to  have  degree 

K 

qk  +  ^2 

i—\2^-k 


max 

k 


Vi 


(13.2.2) 
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Of  course,  the  au{L),  k  =  1, . . . ,  K ,  may  have  common  factors.  In  that  case,  a 
joint  AR  operator  a(L)  with  degree  much  lower  than  Pi  may  be  possible. 
Correspondingly,  the  joint  MA  operator  may  have  degree  lower  than  (13.2.2). 
Suppose,  for  instance,  that  K  =  3  and 

a±(L)  =  1  —  an  L,  ?tii(L)  =  1  +  muL, 

a2{L)  =  1  -  «2i L  —  a22 L2,  m2(L)  =  1  +  m2iL, 

a3(L)  =  1  -  a3iL,  m3(L)  =  1. 

Now  a  joint  AR  operator  is 
a(L)  =  ai(L)a2(L)a3(L)1 

which  has  degree  4.  However,  if  a2 (L)  can  be  factored  as 

a2(L)  =  (1  —  anL)(l  —  a3iL)  =  ai(L)a3  (L), 

then  a  common  AR  operator  a(L)  =  a2 (L)  may  be  chosen  and  we  get  uni¬ 
variate  models 

a(L)ylt  =  a3(L)mi(L)vlt  [AR.MA(2,  2)], 

a(L)y2t  =  m2(L)v2t  [AR.MA(2,1)],  (13.2.3) 

a(L)y3t  =  ai(L)m3(L)v3t  [AR.MA(2, 1)]. 

The  maximum  of  the  individual  MA  degrees  is  chosen  as  the  joint  MA  degree, 
that  is,  q  =  2  and,  of  course, 

V  =  degree(a(L))  =  2. 

A  problem  that  should  be  noticed  from  this  discussion  and  example  is  that  the 
degrees  p  and  q  determined  in  this  way  may  be  quite  large.  It  is  conceivable 
that  p  =  ill  Pi  the  smallest  possible  AR  order  for  the  final  equations  form 
representation  and  the  corresponding  MA  degree  may  be  quite  substantial  too. 
This,  clearly,  can  be  a  disadvantage  as  unduely  many  parameters  can  cause 
trouble  in  a  final  estimation  algorithm  and  may  lead  to  imprecise  forecasts 
and  impulse  responses. 

Often  it  may  be  possible  to  impose  restrictions  on  the  AR  and  MA  opera¬ 
tors  in  (13.2.1).  This  modification  may  either  be  done  in  a  third  stage  of  the 
procedure  or  it  may  be  incorporated  in  Stages  I  and/or  II,  depending  on  the 
type  of  information  available.  Restrictions  may  be  obtained  with  the  help  of 
statistical  tools  such  as  testing  the  significance  of  single  coefficients  or  groups 
of  parameters.  Alternatively,  restrictions  may  be  implied  by  subject  matter 
theory.  Zellner  &  Palm  (1974)  give  a  detailed  example  where  both  types  of 
restrictions  are  used. 

Perhaps  because  of  the  potentially  great  number  of  parameters,  final  form 
modelling  has  not  become  very  popular.  It  can  only  be  recommended  if  it 
results  in  a  reasonably  parsimonious  parameterization. 
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13.2.2  An  Example 

For  illustrative  purposes,  we  consider  a  bivariate  system  consisting  of  first 
differences  of  logarithms  of  income  (y i)  and  consumption  (2/2)-  We  use  again 
the  data  from  File  El  up  to  the  fourth  quarter  of  1978.  If  the  3-dimensional 
system  involving  investment  in  addition  really  were  generated  by  a  VAR(2) 
process,  as  assumed  in  Chapters  3  and  4,  it  is  quite  possible  that  the  sub¬ 
process  consisting  of  income  and  consumption  only  has  a  mixed  VARMA 
generation  process  with  nontrivial  MA  part  (see  Section  11.6.1).  Moreover, 
the  marginal  univariate  processes  for  y\  and  y2  may  be  of  a  mixed  ARMA 
type.  However,  we  found  that  the  subset  AR(3)  models  (with  standard  errors 
in  parentheses) 

(1  -  .245  L3)yu  =  .015  +vlt  (13.2.4) 

(.113)  (.003) 

and 

(1  -  .309  L2  -  .187  L3)y2t  =  .010  +  v2t  (13.2.5) 

(.111)  (.111)  (.004) 

fit  the  data  quite  well.  For  illustrative  purposes,  we  will  therefore  proceed 
from  these  models.  The  reader  may  try  to  find  better  models  and  repeat  the 
analysis  with  them. 

Generally,  a  (1  x  1)  scalar  operator 

7  (L)  =  l-7i  L - 

of  degree  p  can  be  factored  in  p  components, 

7(T)  =  (1  — Ari)  •••(!  — APL), 


where  Ai, . . . ,  Ap  are  the  reciprocals  of  the  roots  of  j{z).  Thus,  the  two  AR 
operators  from  (13.2.4)  and  (13.2.5)  can  be  factored  as 

ax  (L)  =  1  -  .245 L3 

=  (1  —  .626T)(1  +  (.313  +  .542i)L)(l  +  (.313  —  .542z)L)  (13.2.6) 

and 

a2(L)  =  1  -  .309 L2  -  .187T3 

=  (1  -  ,747L)(1  +  (.374  +  ,332f)L)(l  +  (.374  -  .332 i)L),  (13.2.7) 

where  i  denotes  the  imaginary  part  of  the  complex  numbers.  None  of  the 
factors  in  (13.2.6)  is  very  close  to  any  of  the  factors  in  (13.2.7).  Thus,  models 
with  common  AR  operator  may  be  of  the  form 

ai(L)a2(L)yit  =  a2(L)v  lt 


and 
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a1(L)a2(L)y2t  =  ai{L)v2t. 

With  the  arguments  of  the  previous  subsection,  the  resulting  bivariate  final 
equations  model  is  a  VARMA(6,  3)  process, 


ai(L)a2{L) 


Vit 
2/2 1 


(I3  +  M-lL  +  M2L2  +  M3L 3) 


uu 

u2t 


(13.2.8) 


Obviously,  this  model  involves  very  many  parameters  and  is  therefore  unattrac¬ 
tive.  In  fact,  such  a  heavily  parameterized  model  may  cause  numerical  prob¬ 
lems  when  full  maximum  likelihood  estimation  is  attempted.  It  is  possible,  if 
not  likely,  that  some  parameters  turn  out  to  be  insignificant  and  could  be  set 
to  zero.  However,  the  significance  of  parameters  is  commonly  judged  on  the 
basis  of  their  standard  errors  or  t-ratios.  These  quantities  become  available  in 
the  ML  estimation  round  which,  as  we  have  argued,  may  be  problematic  in 
the  present  case. 

Given  the  estimation  uncertainty,  one  may  argue  that  the  real  factors  in 
the  operators  a±(L)  and  a2 (L)  may  be  identical.  Proceeding  under  that  as¬ 
sumption  results  in  a  VARMA(5,  2)  final  equations  form.  Such  a  model  is 
more  parsimonious  and  has  therefore  more  appeal  than  (13.2.8).  Still  it  in¬ 
volves  a  considerable  number  of  parameters.  This  example  illustrates  why 
final  equations  modelling,  although  relatively  simple,  does  not  enjoy  much 
popularity.  For  higher-dimensional  models,  the  problem  of  heavy  parameteri¬ 
zation  becomes  even  more  severe  because  the  number  of  parameters  is  likely 
to  increase  rapidly  with  the  dimension  of  the  system.  We  will  now  present 
procedures  for  specifying  echelon  forms. 


13.3  Specification  of  Echelon  Forms 

In  specifying  an  echelon  VARMA  representation,  the  objective  is  to  find  the 
Kronecker  indices  and  possibly  impose  some  further  restrictions  on  the  param¬ 
eters.  For  a  A-dimensional  process,  there  are  K  Kronecker  indices.  Different 
strategies  have  been  proposed  for  their  specification.  We  will  discuss  some  of 
them  in  the  following.  Once  the  Kronecker  indices  are  determined,  further 
restrictions  may  be  imposed,  for  instance,  on  the  basis  of  significance  tests  for 
individual  coefficients  or  groups  of  parameters. 

In  the  first  subsection  below,  we  will  discuss  a  procedure  for  specifying  the 
Kronecker  indices  which  is  usually  not  feasible  in  practice.  It  is  nevertheless 
useful  to  study  that  procedure  because  the  feasible  strategies  considered  in 
Subsections  13.3.2-13.3.4  may  be  regarded  as  approximations  or  short-cuts  of 
that  procedure  with  similar  asymptotic  properties.  In  Subsection  13.3.2,  we 
present  a  procedure  which  is  easy  to  carry  out  for  systems  with  small  Kro¬ 
necker  indices  and  low  dimensions.  It  is  quite  costly  for  higher-dimensional 
systems,  though.  For  such  systems  a  specification  strategy  inspired  by  Han¬ 
nan  &  Kavalieris  (1984)  or  a  procedure  due  to  Poskitt  (1992)  may  be  more 
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appealing.  These  approaches  are  considered  in  Subsections  13.3.3  and  13.3.4, 
respectively.  The  material  discussed  in  this  section  is  covered  in  more  depth 
and  more  rigorously  in  Hannan  &  Deistler  (1988,  Chapters  5,  6,  7)  and  Poskitt 
(1992). 


13.3.1  A  Procedure  for  Small  Systems 


If  it  is  known  that  the  generation  process  of  a  given  K -dimensional  multiple 
time  series  admits  an  echelon  VARMA  representation  with  Kronecker  indices 
Pk  <  Pmaxi  k  =  1, . . . ,  K,  where  pm ax  is  a  prespecified  number,  then,  in  theory, 
it  is  possible  to  evaluate  the  maximum  log-likelihood  for  all  sets  of  Kronecker 
indices  p  =  (pi, . . .  ,Pk)  with  pk  <  pmax  and  choose  the  set  p  that  optimizes 
a  specific  criterion.  This  approach  is  completely  analogous  to  the  specification 
of  the  VAR  order  in  the  finite  order  VAR  case  considered  in  Section  4.3.  In 
that  section,  we  have  discussed  the  possibility  to  consistently  estimate  the 
VAR  order  with  such  an  approach.  It  turns  out  that  a  similar  result  can  be 
obtained  for  the  present  more  general  VARMA  case. 

Before  we  give  further  details,  it  may  be  worth  emphasizing,  however, 
that  in  the  VARMA  case,  such  a  specification  strategy  is  generally  not  fea¬ 
sible  in  practice  because  the  maximization  of  the  log-likelihood  is  usually 
quite  costly  and,  for  systems  with  moderate  or  high  dimensions,  an  enor¬ 
mous  number  of  likelihood  maximizations  would  be  required.  For  instance, 
for  a  five-dimensional  system,  evaluating  the  maximum  of  the  log-likelihood 
for  all  vectors  of  Kronecker  indices  p  =  {pi,  ■  ■  •  ,Ph)  with  pfc  <  8  requires 
95  =  59,049  likelihood  maximizations.  Despite  this  practical  problem,  we  dis¬ 
cuss  the  theoretical  properties  of  this  procedure  to  provide  a  basis  for  the 
following  subsections. 

Let  us  denote  by  V(p)  the  ML  estimator  of  the  white  noise  covariance 
matrix  Uu  obtained  for  a  set  of  Kronecker  indices  p.  Furthermore,  let 

Cr(p)  :=  In  |27u(p)|  +  cTd(p)/T  (13.3.1) 


be  a  criterion  to  be  minimized  over  all  sets  of  Kronecker  indices  p  = 
(pi , . . .  ,Pk),  Pk  <  Pmax-  Here  d( p)  is  the  number  of  freely  varying  parameters 
in  the  ARMA^(p)  form.  For  example,  for  a  bivariate  system  with  Kronecker 
indices  p  =  (pi,P2)  =  (1,0),  the  ARMANI, 0)  form  is 
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Thus,  d(l,0)  =  4.  In  (13.3.1),  ct  is  a  sequence  indexed  by  the  sample  size  T. 

In  general,  if  models  are  included  in  the  search  procedure  for  which  all 
Kronecker  indices  exceed  the  true  ones,  the  estimation  of  unidentified  models 
is  required  for  which  cancellation  of  the  VAR  and  MA  operators  is  possible. 
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This  over-specification  is  not  necessarily  a  problem  for  evaluating  the  crite¬ 
rion  in  (13.3.1)  because  we  only  need  the  maximum  log-likelihood  or  rather 
In  |27u(p)|  in  that  criterion.  That  quantity  can  be  determined  even  if  the  cor¬ 
responding  VARMA  coefficients  are  meaningless.  The  coefficients  cannot  and 
should  not  be  interpreted,  however. 

Note  that  the  criterion  (13.3.1)  is  very  similar  to  that  considered  in  Propo¬ 
sition  4.2  of  Chapter  4.  In  that  proposition,  the  consistency  or  inconsistency  of 
a  criterion  is  seen  to  depend  on  the  choice  of  the  sequence  ct ■  Hannan  (1981) 
and  Hannan  &  Deistler  (1988,  Chapter  5,  Section  5)  showed  that  a  criterion 
such  as  the  one  in  (13.3.1)  provides  a  consistent  estimator  of  the  true  set  of 
Kronecker  indices  if  ct  is  a  nondecreasing  function  of  T  satisfying 

ct  — ►  oo  and  ct/T  — 0  as  T  — ■>  oo,  (13.3.2) 

and  the  true  data  generation  process  satisfies  some  weak  conditions.  If,  in 
addition, 

cT/21nlnT>l  (13.3.3) 

eventually  as  T  — »  oo,  the  procedure  provides  a  strongly  consistent  estimator 
of  the  true  Kronecker  indices.  The  conditions  for  the  VARMA  process  are, 
for  instance,  satisfied  if  the  white  noise  process  ut  is  identically  distributed 
standard  white  noise  (see  Definition  3.1)  and  the  true  data  generation  process 
admits  a  stable  and  invertible  ARMA^  representation  with  Kronecker  indices 
not  greater  than  pmax.  This  result  extends  Proposition  4.2  to  the  VARMA 


case. 

Implications  of  this  result  are  that  the  Schwarz  criterion  with  ct  =  In  T, 

SC(p)  :=  In  |r„(p)|  +  d(p)  lnT/T,  (13.3.4) 

is  strongly  consistent  and  that  the  Hannan-Quinn  criterion,  using  the  border¬ 
line  penalty  term  ct  =  2  In  In  T, 

HQ(p)  :=  ln|Vu(p)|  +  2d(p)  lnlnT/T,  (13.3.5) 

is  consistent.  Hannan  &  Deistler  (1988)  also  showed  that 

AIC(p)  :=  In  |£„(p)  |  +  2d(p)/T  (13.3.6) 


with  ct  =  2  is  not  a  consistent  criterion.  Again  these  results  are  similar  to 
those  for  the  finite  order  VAR  case. 

As  in  that  case,  it  is  worth  emphasizing  that  these  results  do  not  necessarily 
imply  the  inferiority  of  AIC  or  HQ.  In  small  samples,  these  criteria  may  be 
preferable.  They  may,  in  fact,  provide  superior  models  for  a  specific  analysis  of 
interest.  Also,  in  practice,  the  actual  data  generation  mechanism  will  usually 
not  really  admit  a  VARMA  representation.  Recall  that  the  best  we  can  hope 
for  is  that  our  model  is  a  good  approximation  to  the  true  data  generation 
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process.  In  that  case,  the  relevance  of  the  consistency  property  is  of  course 
doubtful. 

Again,  the  specification  strategy  presented  in  the  foregoing  is  not  likely  to 
have  much  practical  appeal  as  it  is  computationally  too  burdensome.  In  the 
next  subsections,  more  practical  modifications  are  discussed. 


13.3.2  A  Full  Search  Procedure  Based  on  Linear  Least  Squares 
Computations 

The  Procedure 

A  major  obstacle  for  using  the  procedure  described  in  the  previous  subsec¬ 
tion  is  the  requirement  to  maximize  the  log- likelihood  various  times.  This 
maximization  is  costly  because  for  mixed  VARMA  models  the  log-likelihood 
function  is  nonlinear  in  the  parameters  and  iterative  optimization  algorithms 
have  to  be  employed.  Because  we  just  need  an  estimator  of  Su  for  the  evalu¬ 
ation  of  model  selection  criteria  such  as  (13.3.1),  an  obvious  modification  of 
the  procedure  would  use  an  estimator  that  avoids  the  nonlinear  optimization 
problem.  Such  an  estimator  may  be  obtained  from  the  preliminary  estimation 
procedure  described  in  Chapter  12,  Section  12.3.4.  Therefore,  a  specification 
of  the  Kronecker  indices  may  proceed  in  the  following  stages. 

STAGE  I:  Fit  a  long  VAR  process  of  order  n,  say,  to  the  data  and  obtain 
the  estimated  residual  vectors  ut(n),  t  =  1, . . .  ,T.  ■ 

The  choice  of  n  could  be  based  on  an  order  selection  criterion  such  as  AIC. 
In  any  case,  n  has  to  be  greater  than  the  largest  Kronecker  index  pmax  to  be 
considered  in  the  next  stage  of  the  procedure. 

STAGE  II:  Using  the  residuals  Ut(n)  from  Stage  I,  compute  the  preliminary 
estimator  of  Section  12.3.4  for  all  sets  of  Kronecker  indices  p  with  pk  <  pmax, 
where  the  latter  number  is  a  prespecified  upper  bound  for  the  Kronecker  in¬ 
dices.  Determine  all  corresponding  estimators  V„(p)  based  on  the  residuals 
of  the  preliminary  estimations  (see  (12.3.24)).  (Here  we  suppress  the  order  n 
from  the  first  stage  for  notational  convenience  because  the  same  n  is  used  for 
all  Au(p)  at  this  stage.)  Choose  the  estimator  p  which  minimizes  a  prespeci¬ 
fied  criterion  of  the  form  (13.3.1).  ■ 

The  choice  of  the  criterion  Cr(p)  is  left  to  the  researcher.  SC,  HQ,  and 
AIC  from  (13.3.4)-(13.3.6)  are  possible  candidates.  Stage  II  could  be  iterated 
by  using  the  residuals  from  a  previous  run  through  Stage  II  instead  of  the 
residuals  from  Stage  I.  Once  an  estimate  p  of  the  Kronecker  indices  is  deter¬ 
mined,  the  ML  estimates  conditional  on  p  may  be  computed  in  a  final  stage. 

STAGE  III:  Estimate  the  echelon  form  VARMA  model  with  Kronecker  in¬ 
dices  p  by  maximizing  the  Gaussian  log-likelihood  function  or  by  just  one  step 
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of  the  scoring  algorithm  (see  Section  12.3.2). 


Hannan  &  Deistler  (1988,  Chapter  6)  discussed  conditions  under  which 
this  procedure  provides  a  consistent  estimator  of  the  Kronecker  indices  and 
VARMA  parameter  estimators  that  have  the  same  asymptotic  properties  as 
the  estimators  obtained  for  given,  known,  true  Kronecker  indices  (see  Proposi¬ 
tion  12.1).  In  addition  to  our  usual  assumptions  for  the  VARMA  process  such 
as  stability  and  invertibility,  the  required  assumptions  relate  to  the  criteria 
for  choosing  the  VAR  order  in  Stage  I  and  the  Kronecker  indices  in  Stage 
II.  These  conditions  are  asymptotic  conditions  that  leave  some  room  for  the 
actual  choice  in  small  samples.  Any  criterion  from  (13.3.4)-(13.3.6)  may  be  a 
reasonable  choice  in  practice. 

The  procedure  still  involves  extensive  computations,  unless  the  dimension 
K  of  the  underlying  multiple  time  series  and  pmax  are  small.  For  example, 
for  a  five-dimensional  system  with  pmax  =  8  we  still  have  to  perform  95  = 
59,  049  estimations  in  order  to  compare  all  feasible  models.  Although  these 
estimations  involve  linear  least  squares  computations  only,  the  computational 
costs  may  be  substantial.  Therefore,  we  outline  two  less  costly  procedures 
in  the  following  subsections.  For  small  systems,  the  present  procedure  is  a 
reasonable  choice.  We  give  an  example  next. 


An  Example 


We  consider  again  the  income/consumption  example  from  Section  13.2.2.  In 
the  first  stage  of  our  procedure,  we  fit  a  VAR(8)  model  (n  =  8)  and  we  use 
the  residuals,  ut( 8),  at  the  next  stage.  The  choice  of  n  =  8  is  to  some  extent 
arbitrary.  We  have  chosen  a  fairly  high  order  to  gain  flexibility  for  the  Kro¬ 
necker  indices  considered  at  Stage  II.  Recall  that  n  must  exceed  all  Kronecker 
indices  to  be  considered  subsequently.  Using  the  procedure  described  as  Stage 
II,  we  have  estimated  models  with  Kronecker  indices  pk  <  pmax  =  4  and  we 
have  determined  the  corresponding  values  of  the  criteria  AIC  and  HQ.  They 
are  given  in  Tables  13.1  and  13.2,  respectively.  Both  criteria  reach  their  mini¬ 
mum  for  p  =  {pi,p2j  =  (0,  2).  The  ARMA#(0,  2)  form  is  precisely  the  model 
estimated  in  Chapter  12,  Section  12.3.5.  Replacing  the  parameters  by  their 
ML  estimates  with  estimated  standard  errors  in  parentheses,  we  have 
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Obviously,  some  of  the  parameters  are  quite  small  compared  to  their  esti¬ 
mated  standard  errors.  In  such  a  situation,  one  may  want  to  impose  further 
zero  constraints  on  the  parameters.  Because  022,1,  S22, 2,  and  77122,2  have  the 
smallest  f-ratios  in  absolute  terms,  we  restrict  these  estimates  to  zero  and 
reestimate  the  model.  The  resulting  system  obtained  by  ML  estimation  is 
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Now  all  parameters  are  significant  under  a  two-standard  error  criterion. 


Table  13.1.  AIC  values  of  ARMA_e(pi,p2)  models  for  the 
income/consumption  data 


Pi 

P2 

0 

1 

2 

3 

4 

0 

-16.83 

-18.41 

-18.30 

-18.25 

-18.15 

1 
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2 
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3 
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4 

-18.47 

-18.38 
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-18.20 

-18.05 

*  Minimum 


Table  13.2.  HQ  values  of  ARMA_e(pi,p2)  models  for  the 
income/consumption  data 


Pi 
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-18.31 
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13.3.3  Hannan-Kavalieris  Procedure 

A  full  search  procedure  for  the  optimal  Kronecker  indices,  as  in  Stage  II  of  the 
previous  subsection,  involves  a  substantial  amount  of  computation  work  if  the 
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dimension  K  of  the  time  series  considered  is  large  or  if  the  upper  bound  pmax 
for  the  Kronecker  indices  is  high.  For  instance,  if  monthly  data  are  considered 
and  lags  of  at  least  one  year  are  deemed  necessary,  pmax  >  12  is  required. 
Even  if  the  system  involves  just  three  variables  ( K  =  3)  the  number  of  mod¬ 
els  to  be  compared  is  vast,  namely  133  =  2197.  Therefore,  shortcuts  for  Stage 
II  of  the  previous  subsection  were  proposed.  The  first  one  we  present  here  is 
inspired  by  discussions  of  Hannan  &  Kavalieris  (1984).  Therefore,  we  call  it 
the  Hannan-Kavalieris  procedure  although  these  authors  proposed  a  more  so¬ 
phisticated  approach.  In  particular,  they  discussed  a  number  of  computational 
simplifications  (see  also  Hannan  &  Deistler  (1988,  Chapter  6)).  The  following 
modification  of  Stage  II  may  be  worth  trying. 

STAGE  II  (HK):  Based  on  the  covariance  estimators  obtained  from  the  pre¬ 
liminary  estimation  procedure  of  Section  12.3.4,  find  the  Kronecker  indices, 
say  p^1-1  =  pW(l, . . . ,  1)„  that  minimize  a  prespecified  criterion  of  the  type 
Cr(p)  in  (13.3.1)  over  p  =  p(  1, . . . ,  1),  p  =  0, . . .  ,pmax,  that  is,  all  Kronecker 
indices  are  identical  in  this  first  step.  Then  the  last  index  px  is  varied  between 
0  and  pt1'1  while  all  other  indices  are  fixed  at  p^\  We  denote  the  optimal  value 
of  pk  by  pk,  that  is,  px  minimizes  the  prespecified  criterion.  Then  we  proceed 
in  the  same  way  with  px-i  and  so  on.  More  generally,  Pk  is  chosen  such  that 

Cr(p{1\ . . .  ,p{1\pk, . . .  ,pK) 

=  min{Cr(p(1), . . .  ,p(1),p,pfc+1, . . .  ,pK)\p  =  0, . . .  ,p(1)}. 


This  modification  reduces  the  computational  burden  considerably.  Just  to 
give  an  example,  for  K  =  5  and  pmax  =  8,  at  most  9  +  5  •  9  =  54  models  have 
to  be  estimated.  If  p1-1'  is  small,  then  the  number  may  be  substantially  lower. 
For  comparison  purposes,  we  repeat  that  the  number  of  estimations  in  a  full 
search  procedure  would  be  95  =  59,  049. 

To  illustrate  the  procedure,  consider  the  following  panel  of  criterion  values 
for  Kronecker  indices  (p\,P2)' 


P2 

Pi 

0 

1 

2 

3 

0 

3.48 

3.28 

3.26 

3.27 

1 

3.25 

3.23 

3.14 

3.20 

2 

3.23 

3.21 

3.15 

3.19 

3 

3.24 

3.20 

3.21 

3.18 

The  minimum  value  on  the  main  diagonal  is  obtained  for  j/1  -1  =  2  with 
Cr(p(d,pd)j  =  3.15.  Going  upward  from  (pi,P2)  —  (2,2),  the  minimum  is 
seen  to  be  Cr(2, 1)  =  3.14.  Turning  left  from  (pi,p2)  =  (2,1),  a  further  re¬ 
duction  of  the  criterion  value  is  not  obtained.  Therefore,  the  estimate  for  the 
Kronecker  indices  is  p  =  (2, 1). 
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In  this  case,  we  have  actually  found  the  overall  minimum  of  all  criterion 
values  in  the  panel,  that  is, 

Cr(2, 1)  =  min{Cr(pi,p2)|Pi  =  0,1,2,  3}. 

In  general,  the  Hannan-Kavalieris  procedure  will  not  lead  to  the  overall  mini¬ 
mum.  For  instance,  for  our  bivariate  income/consumption  example  from  Sec¬ 
tion  13.3.2,  we  can  find  the  HK  estimates  p  from  Tables  13.1  and  13.2.  For 
example,  on  the  main  diagonal  of  the  panel  in  Table  13.2,  the  HQ  criterion 
assumes  its  minimum  for  (pi,p2)  =  (1,1)  and  p(HQ)  =  (1,0).  Clearly,  this 
result  differs  from  the  estimate  p  =  (0,  2)  that  was  obtained  in  the  full  search 
procedure. 

Under  suitable  conditions  for  the  model  selection  criteria,  the  HK  proce¬ 
dure  is  consistent.  Hannan  &  Deistler  (1988)  also  discussed  the  consequences 
of  the  true  data  generation  process  being  not  in  the  class  of  considered  pro¬ 
cesses. 


13.3.4  Poskitt’s  Procedure 

Another  short-cut  version  of  Stage  II  of  the  model  specification  procedure 
was  suggested  by  Poskitt  (1992).  It  capitalizes  on  the  important  property  of 
echelon  forms  that  the  restrictions  for  the  k- th  equation  implied  by  a  set  of 
Kronecker  indices  p  are  determined  by  the  indices  pi  <  pk-  They  do  not  de¬ 
pend  on  the  specific  values  of  the  pj  which  are  greater  than  p k-  The  proposed 
modification  of  Stage  II  is  based  on  separate  LS  estimations  of  each  of  the 
K  equations  of  the  system.  The  estimation  is  similar  to  the  preliminary  es¬ 
timation  method  outlined  in  Section  12.3.4,  that  is,  it  uses  the  residuals  of  a 
long  autoregression  from  Stage  I  instead  of  the  true  ut  s.  A  model  selection 
criterion  of  the  form 

Crfc(p)  :=  lndfc(p)  +  cTdk{p)/T  (13.3.7) 

is  then  evaluated  for  each  of  the  I<  equations  separately.  Here  p)  is  the 
residual  variance  estimate  such  that  Ti jj^(p)  is  the  residual  sum  of  squares  of 
the  k- th  equation  in  a  system  with  Kronecker  indices  p,dk(p)  is  the  number 
of  freely  varying  parameters  in  the  k- th  equation,  and  ct  is  a  number  that  de¬ 
pends  on  the  sample  size  T.  Of  course,  (13.3.7)  is  the  single  equation  analogue 
of  the  systems  criterion  (13.3.1).  Stage  II  of  Poskitt’s  procedure  proceeds  then 
as  follows: 

STAGE  II  (P):  Determine  the  required  values  Crfc(p)  and  choose  the  esti¬ 
mates  pk  of  the  Kronecker  indices  according  to  the  following  rule: 

If  Crfc(0, . . . ,  0)  >  Cr/o(l, . . . ,  1)  for  all  k  =  1, . . . ,  K,  compute  Crfc(2,  . . ., 
2),  k  =  1, . . . ,  K,  and  compare  to  Crfc(l, . . . ,  1).  If  the  Cr*,(2, . . . ,  2)  are  all  not 
greater  than  the  corresponding  &*,(!, . . . ,  1),  proceed  to  Crfc(3, . . . ,  3)  and  so 


on. 
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If  at  some  stage 


Cr  k(j  -  1,  •  •  j  ~  1)  >  Cr  k(j, 


does  not  hold  for  all  k,  choose  Pk  =  j  —  1  for  all  k  with 


Cr k(j  -  1,  •  •  • ,  j  ~  1)  <  Cr k(j, . . . ,  j). 


The  pk  obtained  in  this  way  are  fixed  in  all  the  following  steps.  We  continue 
by  increasing  the  remaining  indices  and  comparing  the  criteria  for  those  equa¬ 
tions  for  which  the  Kronecker  indices  are  not  yet  fixed.  Here  it  is  important 
that  the  restrictions  for  the  fc-th  equation  do  not  depend  on  the  Kronecker 
indices  pi  >  pk  which  are  chosen  in  subsequent  steps.  ■ 

To  make  the  procedure  a  bit  more  transparent,  it  may  be  helpful  to  con¬ 
sider  an  example.  Suppose  that  interest  centers  on  a  three-dimensional  system, 
that  is,  K  =  3.  First  Crfc(0,  0, 0)  and  Crfc(l,  1, 1)  are  computed  for  k  =  1,  2, 3. 
Suppose 

Crfc(0,0,0)  >  Crfc(l,l,l),  for  k  =  1,2,3. 

Then  we  evaluate  Crfc(2,  2,  2),  k  =  1,2,  3.  Suppose 
Cri  (1,1,1)  <Cri(2,2,2) 
and 

Cr*(l,l,l)  >Crfc(2,2,2),  forfc  =  2,3. 

Then  pi  =  1  is  fixed  and  Cr^fl,  2,  2)  is  compared  to  Cr^(l,  3,  3)  for  k  =  2, 3. 
Suppose 

Cr2(l,2,2)  >  Cr2(l,3,3)  and  Cr3(l,  2, 2)  <  Cr3(l,  3, 3). 

Then  we  fix  p3  =  2  and  compare  Cr2(l,3,2)  to  Cr2(l,4, 2)  and  so  on  until 
P2  can  also  be  fixed  because  no  further  reduction  of  the  criterion  Cr2(-)  is 
obtained  in  one  step.  It  is  important  to  note  that  for  each  index  only  the 
first  local  minimum  of  the  corresponding  criterion  is  searched  for.  We  are 
not  seeking  a  global  minimum  over  all  p  with  pk  less  than  some  prespecified 
upper  bound.  For  moderate  or  large  systems,  the  present  procedure  has  the 
advantage  of  involving  a  very  reasonable  amount  of  computation  work  only. 

Poskitt  (1992)  derived  the  properties  of  the  Kronecker  indices  and  the 
VARMA  coefficients  estimated  by  this  procedure.  He  gave  conditions  under 
which  the  Kronecker  indices  are  estimated  consistently  and  the  final  VARMA 
parameter  estimators  have  the  asymptotic  distribution  given  in  Proposition 
12.1.  Assuming  that  the  true  data  generation  process  can  indeed  be  described 
by  a  stable,  invertible  ARMA e  representation  with  a  finite  set  of  Kronecker 
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indices,  the  conditions  imposed  by  Poskitt  relate  to  the  distribution  of  the 
white  noise  process  ut,  to  the  choice  of  n,  and  to  the  criteria  Crfc(p). 

With  respect  to  the  process  or  white  noise  distribution,  the  assumptions 
are  satisfied,  for  example,  if  ut  is  Gaussian.  In  fact,  for  most  results  it  suffices 
that  ut  is  standard  white  noise.  An  exception  is  the  asymptotic  distribution  of 
the  white  noise  covariance  estimator  Su.  It  may  change  if  ut  has  a  nonnormal 
distribution. 

The  VAR  order  n  at  Stage  I  is  assumed  to  go  to  infinity  with  the  sample 
size  at  a  certain  rate.  In  practice,  the  order  selection  criteria  AIC  or  HQ  may 
be  used  at  Stage  I.  It  must  be  guaranteed,  however,  that  n  is  greater  than  the 
Kronecker  indices  considered  at  Stage  II(P). 

Poskitt  (1992)  also  discussed  a  modification  of  his  algorithm  that  appears 
to  have  some  practical  advantages.  We  do  not  go  into  that  procedure  here  but 
recommend  that  the  interested  reader  examines  the  relevant  literature.  The 
message  from  the  present  discussion  should  be  that  consistent  and  feasible 
strategies  for  estimating  the  Kronecker  indices  exist.  Poskitt  also  discussed 
the  case  where  the  true  data  generation  process  is  not  in  the  class  of  VARMA 
processes  considered  in  the  specification  procedure.  He  derived  some  asymp¬ 
totic  results  for  this  case  as  well. 

In  summary,  a  full  search  procedure  is  feasible  for  low-dimensional  systems 
if  the  maximum  for  the  Kronecker  indices  is  small  or  moderate.  For  high¬ 
dimensional  systems  and/or  large  upper  bounds  of  the  Kronecker  indices,  the 
Hannan-Kavalieris  procedure  or  Poskitt’s  specification  strategy  are  preferable 
from  a  computational  point  of  view.  The  relative  performance  in  small  samples 
is  so  far  unknown  in  general.  It  is  left  to  the  individual  researcher  to  decide 
on  a  specific  specification  procedure  with  his  or  her  available  resources  and 
perhaps  the  objective  of  the  analysis  in  mind.  Of  course,  it  is  legitimate  to 
try  different  strategies  and  criteria  and  compare  the  resulting  models  and  the 
implications  for  the  subsequent  analysis. 


13.4  Remarks  on  Other  Specification  Strategies  for 
VARMA  Models 

A  number  of  other  specification  strategies  for  VARMA  processes  were  pro¬ 
posed  and  investigated  in  the  literature  based  on  representations  other  than 
the  final  equations  and  echelon  forms.  Examples  are  Quenouille  (1957),  Tiao 
&  Box  (1981),  Jenkins  &  Alavi  (1981),  Aoki  (1987),  Cooper  &  Wood  (1982), 
Granger  &  Newbold  (1986),  Akaike  (1976),  Tiao  &  Tsay  (1989),  Tsay  (1989a, 
b),  to  list  just  a  few.  Some  of  these  strategies  are  based  on  subjective  criteria. 
As  mentioned  earlier,  none  of  these  procedures  seems  to  be  in  common  use 
for  analyzing  economic  time  series  and  none  of  them  has  become  the  standard 
procedure.  So  far,  few  VARMA  analyses  of  higher-dimensional  time  series  are 
reported  in  the  literature.  Given  this  state  of  affairs,  it  is  difficult  to  give 
well-founded  recommendations  as  to  which  strategy  to  use  in  any  particular 
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situation.  Those  familiar  with  the  Box-Jenkins  approach  for  univariate  time 
series  modelling  will  be  aware  of  the  problems  that  can  arise  even  in  the 
univariate  case  if  the  investigator  has  to  decide  on  a  model  on  the  basis  of 
statistics  such  as  the  autocorrelations  and  partial  autocorrelations.  Therefore, 
it  is  an  obvious  advantage  to  have  an  automatic  or  semiautomatic  procedure 
if  one  feels  uncertain  about  the  interpretation  of  statistical  quantities  related 
to  specific  characteristics  of  a  process  and  if  little  or  no  prior  information  is 
available.  On  the  other  hand,  if  firmly  based  prior  information  about  the  data 
generation  process  is  available,  then  it  may  be  advantageous  to  use  that  at  an 
early  stage  and  depart  from  automatic  procedures. 


13.5  Model  Checking 

Prominent  candidates  in  the  model  checking  tool-kit  are  tests  of  statistical 
hypotheses.  All  three  testing  principles,  LR  (likelihood  ratio),  LM  (Lagrange 
multiplier),  and  Wald  tests  (see  Appendix  C.7)  can  be  applied  in  principle  in 
the  VARMA  context.  Because  estimation  requires  iterative  procedures,  it  is 
often  desirable  to  estimate  just  one  model.  Hence,  LR  tests  which  require  esti¬ 
mation  under  both  the  null  and  alternative  hypotheses  are  often  unattractive. 
In  finite  order  VAR  modelling,  the  unrestricted  version  is  usually  relatively 
easy  to  estimate  and  therefore  it  makes  sense  to  use  Wald  tests  in  the  pure 
VAR  case  because  these  tests  are  based  on  the  unconstrained  estimator.  In 
contrast,  the  restricted  estimator  is  often  easier  to  obtain  in  the  VARMA  con¬ 
text  when  models  with  nontrivial  MA  part  are  considered.  In  this  situation, 
LM  tests  have  an  obvious  advantage  because  the  LM  statistic  involves  the 
restricted  estimator  only.  Of  course,  the  restricted  estimator  is  especially  easy 
to  determine  if  the  constrained  model  is  a  pure,  finite  order  VAR  process. 
We  will  briefly  discuss  LM  tests  in  the  following.  For  further  discussions  and 
proofs  the  reader  is  referred  to  Kohn  (1979),  Hosking  (1981b),  and  Poskitt  & 
Tremayne  (1982). 


13.5.1  LM  Tests 

Suppose  we  wish  to  test 

Hu  :  ip(f3)  =  0  against  Hi  :  y>(/3)  ±  0,  (13.5.1) 

where  (3  is  an  M-dimensional  parameter  vector  and  ip(-)  is  a  twice  continuously 
differentiable  function  with  values  in  the  A-dimensional  Euclidean  space.  In 
other  words,  <p((3)  is  an  ( N  x  1)  vector  and  we  assume  that  the  matrix  dip/df3' 
of  first  order  partial  derivatives  has  rank  N  at  the  true  parameter  vector.  In 
this  setup,  we  consider  the  case  where  the  restrictions  relate  to  the  VARMA 
coefficients  only.  Moreover,  we  assume  that  the  conditions  of  Proposition  12.1 
are  satisfied. 
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For  instance,  in  the  bivariate  zero  mean  VARMA(1, 1)  model  with  Kro- 
necker  indices  (pi,p2)  =  (1,1), 


2/1,  t  _  OL  ii;i  012,1  2/i,  t-i 


l  2/2, t  J  l  021,1  022,1  J  L  2/2,*— 1  J 

+  [“l-‘l  +  [mll-1  ^12,1  1  [  «!,t-l  1  (13.5.2) 

_  U2,t  J  L  TO21,1  m22,i  J  L  M2,t-1 

with  (3 '  =  (o1i)i,a2i,i,a12)i, 022,1, mil, i,TO2i,i,TOi2,i,m22,i),  one  may  wish 
to  test  that  the  MA  degree  is  zero,  that  is, 


mu,! 

'  0  " 

m2i,i 

0 

m12,i 

0 

_  m22,i  _ 

0 

m=  =  o  ■ 

mi2,i  u 

m22,  i  0 

The  corresponding  matrix  of  partial  derivatives  is 

which  obviously  has  rank  N  =  4. 

As  another  example,  suppose  we  wish  to  test  for  Granger-causality  from 
2/2 1  to  yit  in  the  model  (13.5.2).  In  that  case, 


<P{@)  = 


oi2,i  +  mi2,i 

o22imi2i  —  ai2im22,i 


(13.5.3) 


(see  Remark  1  of  Section  11.7.1).  The  corresponding  matrix  of  partial  deriva¬ 
tives  is 

9^_roo  1  0  00  1  O' 

80  0  0  — m22) i  mi2,i  0  0  o22,i  —  oi2,i 

This  matrix  may  have  rank  1  under  special  conditions.  In  particular,  this 
occurs  if  ai2,i  =  m.12,1  =  0  and  a22,i  =  — m22,i. 

The  LM  statistic  for  testing  (13.5.1)  is 


A lm  :=  8{Pr)'Ta{Pr,I%)-'8{Pr)/T, 
where 


~  9  In  l0 

<0r)  = 


T 

-  =  £ 
Pr  t  =  1 


8ut(y,  0)' 

8(3 


(13.5.4) 


(13.5.5) 


is  the  score  vector  evaluated  at  the  restricted  estimator  (3r  and 
,  t  r  n  i  r  o  ^  ,  1 


Za((3r,K)=rY, 


8ut{y,  (3)' 
8(3 


i  8u0y4 3) 

y^u)  o/o/ 


(13.5.6) 
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is  an  estimator  of  the  asymptotic  information  matrix  based  on  the  restricted 
estimator  (3r.  Here 

1  T 
i=l 

Note  that  in  contrast  to  Appendix  C,  Section  C.7,  an  estimator  of  the  asymp¬ 
totic  information  matrix  rather  than  the  information  matrix  is  used  in  (13.5.4). 
Therefore  T  appears  in  the  denominator.  If  Hq  is  true,  the  statistic  A lm  has 
an  asymptotic  x2(./V)-distribution  under  general  conditions. 

The  LM  test  is  especially  suitable  for  model  checking  because  testing  larger 
VAR  or  MA  orders  against  a  maintained  model  is  particularly  easy.  A  new 
estimation  is  not  required  as  long  as  the  null  hypothesis  does  not  change. 
For  instance,  if  we  wish  to  test  a  given  VARMA(p,  q)  specification  against  a 
VARMA(p+s,  q )  or  a  VARMA(p,  g+s)  model,  we  just  need  an  estimator  of  the 
coefficients  of  the  VARMA(p,  q)  process.  Note,  however,  that  a  VARMA(p,  q) 
cannot  be  tested  against  a  VARMA(p  +  s,  q  +  s ),  that  is,  we  cannot  increase 
both  the  VAR  and  MA  orders  simultaneously  because  the  VARMA(p+s,  q+s) 
model  will  not  be  identified  (cancellation  is  possible!)  if  the  null  hypothesis 
is  true.  In  that  case,  the  LM  statistic  will  not  have  its  usual  asymptotic  y2- 
distribution. 

13.5.2  Residual  Autocorrelations  and  Portmanteau  Tests 

Alternative  tools  for  model  checking  are  the  residual  autocorrelations  and 
portmanteau  tests.  The  asymptotic  distributions  of  the  residual  autocorrela¬ 
tions  of  estimated  VARMA  models  were  discussed  by  Hosking  (1980),  Li  & 
McLeod  (1981),  and  Poskitt  &  Tremayne  (1982),  among  others.  We  do  not 
give  the  details  here  but  just  mention  that  the  resulting  standard  errors  of 
autocorrelations  at  large  lags  obtained  from  asymptotic  considerations  are  ap¬ 
proximately  1/Vr,  while  they  may  be  much  smaller  for  low  lags,  just  as  for 
pure  finite  order  VAR  processes. 

The  modified  portmanteau  statistic  is 

h 

Qh  :=  T2  J2(T  ~  *)_1  trid'S^did^1),  (13.5.7) 

i—1 

where 

Ci  ■■=  —  Y  Mv,  P)ut-i(y,  PY 

t=i-\- 1 

and  the  ut(y,  /3)’s  are  the  residuals  of  an  estimated  VARMA  model,  as  before. 
Under  general  conditions,  Qh  has  an  approximate  asymptotic  %-distribution. 
The  degrees  of  freedom  are  obtained  by  subtracting  the  number  of  freely 
estimated  VARMA  coefficients  from  K2h. 
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13.5.3  Prediction  Tests  for  Structural  Change 

In  the  pure  VAR  case,  we  have  considered  prediction  tests  for  structural 
change  as  model  checking  devices.  If  the  data  generation  process  is  Gaus¬ 
sian,  the  two  tests  introduced  in  Chapter  4,  Section  4.6.2,  may  be  applied  in 
the  VAR.MA  case  as  well  with  minor  modifications. 

The  statistics  based  on  h- step  ahead  forecasts  only  are  of  the  form 

rh  :=  eT(/l),Vg(/1)-1eT(ftO,  (13.5.8) 

where  eV ( h)  =  yr+h  —  yr(h)  is  the  error  vector  of  an  h- step  forecast  based 
on  an  estimated  VAR.MA(p,  q)  process  and  Sy{h)  is  an  estimator  of  the  cor¬ 
responding  MSE  matrix  (see  Section  12.5).  The  statistic  may  be  applied  in 
conjunction  with  an  F(K,T  —  K(p+q)  —  l)-distribution.  The  denominator  de¬ 
grees  of  freedom  may  be  used  even  if  constraints  are  imposed  on  the  VAR.MA 
coefficients  because  the  E-distribution  is  just  chosen  as  a  small  sample  approx¬ 
imation  to  a  x2(K)/K  distribution.  Its  justification  comes  from  the  fact  that 
F(K ,  T  —  s)  converges  to  x2{K)/K  for  any  fixed  constant  s,  as  T  approaches 
infinity.  Thus,  any  constant  that  is  subtracted  from  T  in  the  denominator 
degrees  of  freedom  of  the  E-distribution,  is  justified  on  the  same  asymptotic 
grounds.  It  is  not  clear  which  choice  is  best  from  a  small  sample  point  of  view. 

The  other  statistic  considered  in  Section  4.6.2  is  based  on  1-  to  h- step 
forecasts  and,  for  the  present  case,  it  may  be  modified  as 

h 

Xh  :=  T  ^2  u'r+iXj~1uT+i/[(T  +  K(p  +  q)  +  1  )Kh\  (13.5.9) 

i=  1 

and  its  approximate  distribution  for  a  structurally  stable  Gaussian  process 
is  F(Kh,  T  -  K(p  +  q)  -  1).  Here  uT+i  =  yr+i  -  yT+i-i(X)  and  Su  is  the 
ML  estimator  of  ZJU.  Note  that  the  LS  estimator  of  XJU  was  used  in  Section 
4.6.2  instead.  Again,  there  is  not  much  theoretical  justification  for  the  choice 
of  the  denominator  in  (13.5.9)  and  for  the  denominator  degrees  of  freedom  in 
the  approximating  E-distribution.  More  detailed  investigations  of  the  small 
sample  distribution  of  A h  are  required  before  firmly  based  recommendations 
regarding  modifications  of  the  statistic  are  possible.  Here  we  have  just  used 
the  direct  analogue  of  the  finite  order  pure  VAR  case. 

It  is  also  possible  to  fit  a  finite  order  VAR  process  to  data  generated  by  a 
mixed  VAR.MA  process  and  base  the  prediction  tests  on  forecasts  from  that 
model.  In  Chapter  15,  it  will  be  shown  that  such  an  approach  is  theoretically 
sound  under  general  conditions. 


13.6  Critique  of  VARMA  Model  Fitting 

In  this  and  the  previous  two  chapters,  much  of  the  analysis  is  based  on  the  as¬ 
sumption  that  the  true  data  generation  mechanism  is  from  the  VAR.MA(p,  q) 
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class.  In  practice,  any  such  model  is  just  an  approximation  to  the  actual  data 
generation  process.  Therefore,  the  model  selection  task  is  not  really  the  prob¬ 
lem  of  finding  the  true  structure  but  of  finding  a  good  or  useful  approximation 
to  the  real  life  mechanism.  Despite  this  fact,  it  is  sometimes  helpful  to  assume 
a  specific  true  process  or  process  class  to  be  able  to  derive,  under  ideal  condi¬ 
tions,  the  statistical  properties  of  the  procedures  used.  One  then  hopes  that 
the  actual  properties  of  a  procedure  in  a  particular  practical  situation  are  at 
least  similar  to  those  obtained  under  ideal  conditions. 

Against  this  background,  one  may  wonder  whether  it  is  sufficient  or  even 
preferable  to  approximate  the  generation  process  of  a  given  multiple  time  se¬ 
ries  by  a  finite  order  VAR(p)  process  rather  than  go  through  the  painstaking 
specification  and  estimation  of  a  mixed  VARMA  model.  Clearly,  the  estima¬ 
tion  of  VARMA  models  is  in  general  more  complicated  than  that  of  finite  order 
VAR  models.  Moreover,  the  specification  of  VAR  models  by  statistical  meth¬ 
ods  is  much  simpler  than  that  of  VARMA  models.  Are  there  still  situations 
where  it  is  reasonable  to  consider  the  more  complicated  VARMA  models?  The 
answer  to  this  question  is  in  the  affirmative.  For  instance,  if  subject  matter 
theory  suggests  a  VARMA  model  with  nontrivial  MA  part,  it  is  often  neces¬ 
sary  to  work  with  such  a  specification  to  answer  the  questions  of  interest  or 
derive  the  relevant  results.  Also,  in  some  cases,  a  VARMA  approximation  may 
be  more  parsimonious  in  terms  of  the  number  of  parameters  involved  than  an 
appropriate  finite  order  VAR  approximation.  In  such  cases,  the  VARMA  ap¬ 
proximation  may,  for  instance,  result  in  more  efficient  forecasts  that  justify 
the  costly  specification  and  estimation  procedures.  The  future  attractiveness 
of  VARMA  models  will  depend  on  the  easy  availability  of  efficient  and  robust 
estimation  and  specification  procedures  that  reduce  the  costs  to  an  acceptable 
level. 

In  Chapter  15,  we  will  follow  another  road  and  explicitly  assume  that  just 
an  approximating  and  not  a  true  VAR(p)  model  is  fitted.  Assumptions  will  be 
provided  that  allow  the  derivation  of  statistical  properties  in  that  case.  So  far, 
we  have  considered  stable,  stationary  VARMA  processes.  In  the  next  chapter, 
extensions  for  integrated  and  cointegrated  variables  will  be  considered. 


13.7  Exercises 

Problem  13.1 

At  the  first  stage  of  a  final  equations  form  specification  procedure,  the  follow¬ 
ing  two  univariate  models  were  obtained: 

(1  +  0.3L  —  QAL~)y\t  =  (1  +  0.6L)vit, 

(1  —  0.5  L)y2t.  =  (1  +  0.6L)v2t- 

Which  orders  do  you  choose  for  the  bivariate  final  equations  VARMA  repre¬ 
sentation  of  (yit,2/2t)'? 
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Problem  13.2 

At  Stage  II  of  a  specification  procedure  for  an  echelon  form  of  a  bivariate 
system,  the  following  values  of  the  HQ  criterion  are  obtained: 


Pi 

P2 

0 

1 

2 

3 

4 

0 

2.1 

1.9 

1.5 

1.5 

1.6 

1 

1.8 

1.7 

1.4 

1.2 

1.3 

2 

1.7 

1.4 

1.3 

1.4 

1.4 

3 

1.7 

1.4 

1.3 

1.4 

1.5 

4 

1.8 

1.7 

1.6 

1.5 

1.5 

Choose  an  estimate  {p\,P2)'  by  the  Hannan-Kavalieris  procedure.  Interpret 
the  estimate  in  the  light  of  a  full  search  procedure. 

Problem  13.3 

At  the  second  stage  of  the  Poskitt  procedure  for  a  bivariate  model,  the  spec¬ 
ification  criteria  Cri(pi,p2)  and  Cr2{pi,P2)  assume  the  following  values: 


Cr 

i(pi 

1P2) 

Cr 

2{Pl,P2) 

Pi 

Pi 

P2 
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3 

P2 

nr 
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2 

3 
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2.5 

1.7 

1.8 

0 

4.2 

3.2 

3.2 

3.2 

1 

3.5 

1.5 

1.8 

1.7 

1 

3.5 

1.8 

1.9 
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2 

3.5 

1.5 

1.8 

1.9 

2 

3.1 
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1.7 

1.6 
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3.5 

1.5 

1.8 

1.4 
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3.4 

2.1 

1.8 

1.9 

Use  the  Poskitt  strategy  to  find  an  estimate  (P11P2)  of  the  Kronecker  indices. 

The  following  problems  require  the  use  of  a  computer.  They  are  based  on 
the  first  differences  of  the  U.S.  investment  data  given  in  File  E2. 

Problem  13. 4 

Determine  a  final  equations  form  VAR.MA  model  for  the  U.S.  investment  data 
for  the  years  1947  1968  using  the  specification  strategy  described  in  Section 
13.2.1. 

Problem  13.5 

Determine  an  ARM  A  K  model  for  the  U.S.  investment  data  using  the  specifi¬ 
cation  strategy  described  in  Section  13.3.2  with  n  =  6  and  based  on  the  HQ 
criterion.  Compare  the  model  to  the  final  equations  form  model  from  Problem 
13.4. 

Problem  13.6 

Compute  forecasts  for  the  investment  series  for  the  years  1969  and  1970  based 
on  (i)  the  final  equations  form  VARMA  model,  (ii)  the  ARMA jy  model,  and 
(iii)  a  bivariate  VAR(l)  model.  Compare  the  forecasts  to  the  true  values  and 
interpret. 
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Problem  13.7 

Compute  <?,;  and  (9;  impulse  responses  from  the  two  models  obtained  in  Prob¬ 
lems  13.4  and  13.5,  compare  and  interpret  them. 

Problem  13.8 

Specify  a  univariate  AR.MA  model  for  the  sum  zt  =  yu  +  y^t  of  the  two 
investment  series  for  the  years  1947-1968.  Is  the  univariate  AR.MA  model 
compatible  with  the  bivariate  echelon  form  model  specified  in  Problem  13.5? 
(Hint:  Use  the  results  of  Section  11.6.1.) 

Problem  13.9 

Evaluate  forecasts  for  the  Zt  series  of  the  previous  problem  for  the  years  1969- 
1970  and  compare  them  to  forecasts  obtained  by  aggregating  the  bivariate 
forecasts  from  the  ARMA^  model  of  Problem  13.5. 


14 


Cointegrated  VARMA  Processes 


14.1  Introduction 

So  far,  we  have  concentrated  on  stationary  VARMA  processes  for  1(0)  vari¬ 
ables.  In  this  chapter,  the  variables  are  allowed  to  be  7(1)  and  may  be  coin¬ 
tegrated.  As  we  have  seen  in  Chapter  12,  one  of  the  problems  in  dealing  with 
VARMA  models  is  the  nonuniqueness  of  their  parameterization.  For  infer¬ 
ence  purposes,  it  is  necessary  to  focus  on  a  unique  representation  of  a  DGP. 
For  stationary  VARMA  processes,  we  have  considered  the  echelon  form  to 
tackle  the  identification  problem.  In  the  next  section,  this  representation  of 
a  VARMA  process  will  be  combined  with  the  error  correction  (EC)  form. 
Thereby  it  is  again  possible  to  separate  the  long-run  cointegration  relations 
from  the  short-term  dynamics.  The  resulting  representation  turns  out  to  be  a 
convenient  framework  for  modelling  cointegrated  variables. 

The  representation  of  a  VARMA  process  considered  in  this  chapter  is 
characterized  by  the  cointegrating  rank  and  the  Kronecker  indices.  When 
these  quantities  are  given,  the  model  can  be  estimated.  Estimation  procedures 
and  their  asymptotic  properties  are  considered  in  Section  14.3.  A  procedure 
for  specifying  the  Kronecker  indices  and  the  cointegrating  rank  from  a  given 
multiple  time  series  will  be  discussed  in  Section  14.4.  The  forecasting  aspects 
of  our  models  will  be  addressed  briefly  in  Section  14.5  and  an  example  is  given 
in  Section  14.6. 

In  this  chapter,  an  introductory  treatment  of  cointegrated  VARMA  mod¬ 
els  is  given.  The  chapter  draws  on  material  from  Liitkepohl  &  Claessen  (1997) 
who  introduced  the  error  correction  echelon  form  of  a  VARMA  process, 
Poskitt  &  Liitkepohl  (1995)  and  Poskitt  (2003)  who  presented  estimation 
and  specification  procedures  for  such  models,  as  well  as  Bartel  &  Liitkepohl 
(1998)  who  explored  the  small  sample  properties  of  some  of  the  procedures. 
Further  references  to  more  advanced  treatments  of  specific  issues  will  be  given 
throughout  the  chapter. 
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14.2  The  VARMA  Framework  for  1(1)  Variables 

14.2.1  Levels  VARMA  Models 

In  this  chapter,  it  is  assumed  that  some  or  all  of  the  variables  of  interest 
are  /(l)  variables,  whereas  the  remaining  ones  are  again  /( 0).  Moreover,  the 
variables  may  be  cointegrated.  Thus,  we  consider  the  situation  that  was  dis¬ 
cussed  extensively  in  Part  II.  In  contrast  to  the  framework  of  that  part,  we 
now  assume  that  the  DGP  of  yt  =  (yit,  ■  ■  ■  ,yKt)'  is  from  the  VARMA  class, 

4o yt  =  Aiyt-\  +  -  ■  ■ + Apyt-p+ MoUt+ MiUt-i  +  ■  ■  ■ +MpUt~p ,  t  =  1,2,..., 

(14.2.1) 


or 


A(L)yt  =  M(L)ut,  t=  1,2,...,  (14.2.2) 

where  ut  =  yt  =  0  for  t  <  0  is  assumed  for  convenience  and,  as  usual,  ut  is  a 
white  noise  process  with  zero  mean  and  nonsingular,  time  invariant  covariance 
matrix  E(ut:u’t )  =  2JU .  Moreover,  in  (14.2.2)  the  VAR  operator  is 

A(L)  :=  A0  -  AiL - ApLp 

and  the  MA  operator  is 


M(L)  :=  M0  +  M\L  +  •  •  •  +  MpLp . 


The  zero  order  matrices  Ao  and  Mo  are  assumed  to  be  nonsingular  and  some 
of  the  coefficient  matrices  may  be  zero  so  that  the  AR  or  MA  order  may 
actually  be  less  than  p.  The  matrix  polynomials  are  assumed  to  satisfy 

detA(z)^Q,  \z\  <  1,  z  ^  1,  and  detM(z)^0,  \z\  <  1.  (14.2.3) 

The  second  part  of  this  condition  is  the  usual  invertibility  condition  for  the 
MA  operator.  As  in  the  pure  VAR  case,  we  allow  the  VAR  operator  A(z)  to 
have  roots  for  z  =  1  to  account  for  integrated  and  cointegrated  components 
of  yt .  As  mentioned  previously,  all  component  series  are  at  most  /(l),  that  is, 
Ayt  is  stationary  or  at  least  asymptotically  stationary. 

Notice  that  there  are  no  deterministic  terms  in  our  model.  For  the  intro¬ 
ductory  treatment  of  the  present  chapter,  this  setup  is  convenient.  Of  course, 
in  applied  work,  deterministic  terms  will  usually  be  required.  Although  adding 
such  terms  is  formally  straightforward,  it  is  known  from  the  discussion  in 
Chapter  6,  Section  6.4,  and  Chapter  7,  Section  7.2.4,  that  the  implications  of 
such  terms  in  models  with  integrated  variables  are  more  complicated  than  in 
the  stationary  case,  in  particular  with  respect  to  statistical  inference. 

In  this  context,  it  may  also  be  worth  emphasizing  that  the  zero  initial 
value  assumption  (ut  =  yt  =  0  for  t  <  0)  is  not  altogether  innocent.  Allowing 
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for  more  general  initial  values  will  result  in  additional  complications  which  we 
intend  to  avoid  here.  Further  comments  on  these  issues  will  be  provided  later. 

Under  our  assumptions  of  zero  initial  values,  the  process  has  the  pure  VAR 
representation 


t- 1 

yt  =  Yjhyt-i  +  ut, 

i= 1 


(14.2.4) 


where 


n(z)  =  Yuizi  =  M(z)~lA(z)’ 

i= 1 

as  in  Section  11.3.  Notice  that  the  inverse  of  M(z)  exists  under  our  invertibility 
assumption  (14.2.3).  The  process  also  has  a  pure  MA  representation 


t-i 

Vt  = 

i= 0 


where 


<P(z)  =  Y®iZi  =  A(z)~1M(z). 

i=l 

Here  the  inverse  of  A(z)  is  defined  only  in  a  small  neighborhood  of  zero  and, 
in  particular, 

n 

2  =  1 

may  diverge  for  n  — >  oo.  Our  MA  representation  is  still  valid  due  to  the  zero 
initial  value  assumption.  The  VAR  and  MA  representations  of  the  process 
show  that  the  uniqueness  of  the  VARMA  representation  can  be  discussed  in 
the  same  way  as  in  Chapter  12.  We  have  to  find  restrictions  for  A(L)  and 
M(L)  such  that  a  unique  relation  between  [A(L)  :  M(L )]  and  M(L)~l  A(L)  is 
obtained.  From  Chapter  12,  we  know  already  that  the  echelon  form  restrictions 
can  be  used  for  that  purpose.  In  the  present  situation,  a  slight  modification 
turns  out  to  be  useful.  We  will  present  it  in  the  next  subsection. 

If  zero  initial  values  are  not  assumed,  the  initial  values  may  also  help  in 
identifying  the  model.  A  discussion  of  how  initial  values  can  contribute  to 
uniquely  identifying  a  VARMA  process  with  7(1)  variables  is  provided  by 
Poskitt  (2004).  The  problem  is,  however,  that  the  initial  values  of  the  ut  will 
usually  be  unknown  in  practice  and  may  not  be  available  for  identification. 


518  14  Cointegrated  VARMA  Processes 


14.2.2  The  Reverse  Echelon  Form 

In  order  to  obtain  a  unique  representation,  we  use  similar  restrictions  as  in 
Definition  12.2.  We  will,  however,  reverse  the  roles  of  A(L)  and  A I(L)  in  this 
case,  as  proposed  by  Liitkepohl  &  Claessen  (1997).  In  other  words,  we  now 
impose  the  restrictions  placed  on  the  VAR  operator  in  Definition  12.2  on  M (L) 
and  similarly,  the  restrictions  for  M{L )  in  that  definition  will  now  be  imposed 
on  the  VAR  operator.  This  modification  will  turn  out  to  be  convenient  in 
combining  the  restrictions  with  the  error  correction  form.  We  denote  the  ki¬ 
th  elements  of  A(z)  and  M(z)  by  au{z)  and  mu(z),  respectively,  and  impose 
the  constraints  specified  in  the  following  definition. 

Definition  14.1  ( Reverse  Echelon  Form) 

The  VARMA  representation  (14.2.1)  is  in  reverse  echelon  form  if  A{L)  and 
M(L)  satisfy  the  following  restrictions:  The  operator  [A(z)  :  M(z)\  is  left- 
coprime, 


Pk 


mkk(L)  = 

1  H”  ^  ^  Tnkk,!^  "> 

i= 1 

for  k  =  1, . . . ,  K, 

(14.2.5) 

Pk 

mki{L)  = 

for  k  =£  l, 

(14.2.6) 

i—Pk-Pkl+l 

and 


Pk 

&ki{L)  =  aki,o  —  ^2  akl,iLl , 
»= l 


with  aw, o  =  mki, o  for  k,  l  =  1, . . . ,  K. 

(14.2.7) 


Here 

Pki 


min (pk  +  1  ,pi)  for  k  >  l,  ,  .  1 

min (pk,Pi)  for  k  <  l,  ’ 


The  row  degrees  pk  in  this  representation  are  again  called  Kronecker  indices. 
In  (14.2.1),  p  =  max(pi, . . .  ,Pk),  that  is,  p  is  the  maximum  row  degree  or 
Kronecker  index.  ARMAjj.e(pi,  . . .  ,pk)  denotes  a  reverse  echelon  form  with 
Kronecker  indices  pi,  ■  ■  ■  ,Pk ■  ■ 

It  was  argued  by  Poskitt  (2004)  that  the  initial  conditions  may  contribute 
to  a  unique  representation  of  an  integrated  VARMA  process  in  such  a  way  that 
\A(z)  :  M (z)]  does  not  have  to  be  left-coprime.  In  that  case,  the  mapping  from 
the  set  of  operators  [A(z)  :  M(z)\  to  the  set  of  admissible  transfer  functions 
&(z)  may  not  be  one-to-one,  whereas  in  the  present  formulation  which  requires 
left-coprimeness  of  [A{z)  :  M(z)\,  a  one-to-one  mapping  may  be  obtained 
using  the  reverse  echelon  form  restrictions. 
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To  see  the  difference  to  the  ARM  A#  form  discussed  in  Section  12.1,  con¬ 
sider  a  three-dimensional  process  with  Kronecker  indices  (pi,P2,P3)  =  (1, 2, 1) 
as  in  (12.1.21)/(12.1.22).  In  this  case, 


\Pki\ 


111 
12  1 
12  1 


Hence,  an  ARMA^^l,  2, 1)  has  the  following  form: 


1  0  0 
0  10 
0  <232,0  1 


yt 


£*11,1 

<212,1 

<213,1 

0 

0 

0 

<221,1 

<222,1 

<223,1 

yt-i  + 

<221,2 

<222,2 

<223,2 

_  <231,1 

<232,1 

<233,1 

0 

0 

0 

"  1 

0 

0  " 
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77113,1 

+ 

0 

1 

0 

Ut  + 

0 

77122,1 

0 

0 

<232,0 

1 

.  771-31,1 

77132,1 

77133,1  . 

0 

0 

0 

1 

+ 


77121,2  77122,2  717-23,2 

0  0  0 


Mi— 2- 


yt-2 


(14.2.8) 


Clearly,  in  this  representation  the  autoregressive  operator  is  unrestricted  ex¬ 
cept  for  the  constraints  imposed  by  the  maximum  row  degrees  or  Kronecker 
indices  and  the  zero  order  matrix  (Ao  =  Mtj),  whereas  zero  restrictions  are 
placed  on  the  moving  average  coefficient  matrices  attached  to  low  lags  of  the 
7 it-  For  example,  in  (14.2.8),  there  are  two  zero  restrictions  on  M i.  A  compar¬ 
ison  with  the  representation  in  (12.1.21)/(12.1.22)  shows  that  the  restrictions 
imposed  on  Ai  in  (12.1.21)  correspond  to  those  imposed  on  Mi  in  (14.2.8). 


14.2.3  The  Error  Correction  Echelon  Form 

The  EC  form  may  be  obtained  from  (14.2.1)  by  subtracting  Aoyt-i  on  both 
sides  and  rearranging  terms,  as  for  the  VECM  form  of  a  VAR  model  in  Section 
6.3: 


AoAyt  —  Tlyt^i +FiAyt-1  +  ■  ■  ■ +  Tp-iAyt-p-\-i 

+  MyUt  +  MlUf—1  +  •  •  •  +  M.pUt—p 

where 


(14.2.9) 


n  —  —  (Aq  —  Ai  —  •  •  •  —  Ap) 


and 
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r,  —  —  (Aj+i  +  •  •  •  +  A.p),  i  —  1, ...  ,p  —  1. 

Again,  Ylyt-\  is  the  error  correction  term  and  r  =  rk(II)  is  the  cointegrating 
rank  of  the  system. 

If  the  operators  A(L)  and  M(L)  satisfy  the  reverse  echelon  from  re¬ 
strictions,  it  is  easily  seen  that  the  I\;  satisfy  similar  identifying  constraints 
as  the  Ai.  More  precisely,  obeys  the  same  zero  restrictions  as  At+ i  for 
*  =  1, . . .  ,p  —  1,  because  a  zero  restriction  on  an  element  ciki,i  of  A,  implies 
that  the  corresponding  elements  akij  of  Aj  are  also  zero  for  j  >  i.  For  the 
same  reason,  the  zero  restrictions  on  II  are  the  same  as  those  on  A0  —  A±. 
This  means  in  particular  that  there  are  no  echelon  form  zero  restrictions  on 
II  if  all  Kronecker  indices  pk  >  1,  k  =  1  because  in  that  case  the 

reverse  echelon  form  does  not  impose  zero  restrictions  on  A\ .  On  the  other 
hand,  if  some  Kronecker  indices  are  zero,  this  fact  has  implications  for  the 
integration  and  cointegration  structure  of  the  variables.  A  specific  analysis  of 
the  relations  between  the  variables  is  called  for  in  that  case.  Denoting  by  g 
the  number  of  Kronecker  indices  which  are  zero,  it  is  not  difficult  to  see  that 


rk(II)  >  g 


(14.2.10) 


(see  Problem  14.1).  This  result  has  to  be  taken  into  account  in  the  procedure 
for  specifying  the  cointegrating  rank  of  a  VARMA  system,  as  discussed  in 
Section  14.4. 

An  EC  model  which  satisfies  the  reverse  echelon  from  restrictions  will  be 
called  an  EC-ARMAjje  form  in  the  following.  As  an  example,  consider  again 
the  system  (14.2.8).  Its  EC-ARMA^e  form  is 
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As  a  further  example,  consider  the  three-dimensional  ARMA^s (0,  0, 1)  model 
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0  0  0 
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Its  EC-ARMAflE  form  is 
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(14.2.11) 
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Obviously,  the  rank  of 


n  = 


-100 

0-10 

7131  7r32  7133 


is  at  least  2  and,  thus,  the  cointegrating  rank  in  this  case  is  also  at  least  2. 

Specifying  an  EC-ARMA^  model  requires  that  the  cointegrating  rank 
r  is  determined,  the  Kronecker  indices  p\ , . . . ,  pk  are  obtained  and  possibly 
further  over  identifying  zero  restrictions  are  placed  on  the  coefficient  matrices 
r.j  and  My  Before  we  consider  strategies  for  these  tasks,  we  discuss  the  esti¬ 
mation  of  EC-ARMA«£  models  for  given  cointegrating  rank  and  Kronecker 
indices  in  the  next  section. 


14.3  Estimation 

14.3.1  Estimation  of  ARMAre  Models 

For  given  Kronecker  indices,  an  ARMA^s  model  can  be  estimated  even  if 
the  cointegrating  rank  is  unknown.  Under  Gaussian  assumptions,  ML  esti¬ 
mation  can  be  used.  The  estimators  may  be  determined  by  maximizing  a 
log- likelihood  function  as  in  (12.2.24), 

lnl0(j,Su)  =  -^ln\Su\  -  ^^ut(7),A"1ut(7),  (14.3.1) 

^  t=  1 

where  an  additive  constant  is  dropped  and  zero  initial  conditions  are  assumed 
so  that 


t- 1 

Mi)  =  yt  - 

i= 1 

If  the  initial  values  are  nonzero,  In  lo  is  just  an  approximate  log- likelihood. 
Here  7  contains  all  unrestricted  autoregressive  and  moving  average  parame¬ 
ters,  as  in  Section  12.2,  and  maximization  may  proceed  by  an  iterative  pro¬ 
cedure,  as  in  Section  12.3.  Starting  values  are  required  for  such  an  algorithm. 
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The  preliminary  estimator  presented  in  Chapter  12,  Section  12.3.4,  can  be 
used  for  that  purpose  (e.g.,  Poskitt  (2003)). 

As  in  the  case  of  a  cointegrated  VAR  model,  the  ML  estimators  have 
asymptotic  properties  which  are  in  some  respects  different  from  those  obtained 
in  the  stationary  case.  Roughly  speaking,  they  are  the  same  that  would  be 
obtained  if  the  true  cointegration  matrix  were  known.  Thus,  if  0  <  r  <  K, 
generally  the  ML  estimator  7  is  consistent  and 

—  7)  ±  Af(0,  A~), 

where  the  covariance  matrix  is  singular.  These  results  follow  from  Yap  & 
Reinsel  (1995)  and  also  hold  under  suitable  alternative  conditions  if  yt.  is  not 
Gaussian. 

If  the  cointegrating  rank  is  known,  it  is  often  desirable  to  estimate  the 
EC-ARMAfl£;  form  of  the  process  because  it  also  provides  estimates  of  the 
cointegration  relations  which  may  well  be  of  major  interest.  Therefore,  esti¬ 
mation  of  these  models  will  be  considered  next. 

14.3.2  Estimation  of  EC-ARMAjj#  Models 

If  identifying  restrictions  are  imposed  on  the  cointegration  matrix,  then  es¬ 
timation  of  the  EC- ARM  A  rs  form  can  also  be  done  by  Gaussian  ML  based 
on  a  log-likelihood  function  similar  to  (14.3.1),  where  7  now  contains  the 
free  parameters  of  the  EC-  ARM  A  re  form.  An  alternative  approach  would 
be  to  estimate  the  cointegration  matrix  p  first  by  reduced  rank  regression  or 
an  EGLS  procedure  based  on  a  long  VAR(n)  model,  as  in  Section  7.2.  The 
properties  of  this  estimator  will  be  discussed  further  in  Chapter  15,  where 
fitting  approximate  VAR  models  is  discussed.  For  the  present  purposes,  it  is 
sufficient  to  note  that  this  estimator,  say  P,  may  be  used  in  an  ML  proce¬ 
dure  which  estimates  the  other  parameters  by  maximizing  the  log-likelihood 
function  conditionally  on  p.  In  other  words,  the  cointegration  parameters  are 
fixed  at  the  first  stage  estimator  P  of  the  cointegration  matrix  P  and  then  the 
log-likelihood  is  maximized  with  respect  to  the  other  parameters.  The  result¬ 
ing  estimators  have  the  same  asymptotic  properties  as  the  full  ML  estimators 
(see  Yap  &  Reinsel  (1995)). 

Starting  values  for  the  other  parameters  that  may  be  used  as  initial  val¬ 
ues  for  an  iterative  procedure  to  maximize  the  log-likelihood  function,  may 
be  determined  in  an  analogous  way  as  in  Section  12.3.4.  The  short-run  and 
loading  parameter  estimators  have  an  asymptotic  normal  distribution  which 
is  the  same  as  if  the  cointegration  matrix  P  were  known.  This  result,  of  course, 
is  analogous  to  the  pure  VAR  case  considered  in  Chapter  7  (see  also  Phillips 
(1991,  Remark  (n))  and  Yap  &  Reinsel  (1995)). 

The  previous  discussion  assumes  given  Kronecker  indices  and  possibly  a 
known  cointegrating  rank.  Statistical  procedures  for  specifying  these  quanti¬ 
ties  will  be  discussed  next. 
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14.4  Specification  of  EC-ARMAfi£  Models 

14.4.1  Specification  of  Kronecker  Indices 

For  stationary  processes,  proposals  for  specifying  the  Kronecker  indices  of  an 
ARMA  E  model  were  discussed  in  Section  13.3.  The  strategies  for  specifying 
the  Kronecker  indices  of  cointegrated  ARMAIS  forms  presented  in  this  sec¬ 
tion  were  proposed  by  Poskitt  &  Lutkepohl  (1995)  and  Poskitt  (2003).  In  the 
latter  article,  it  is  also  argued  that  they  result  in  consistent  estimators  of  the 
Kronecker  indices  under  suitable  conditions.  In  a  simulation  study,  Bartel  & 
Lutkepohl  (1998)  found  that  they  worked  reasonably  well  in  small  samples, 
at  least  for  the  processes  explored  in  their  Monte  Carlo  study. 

The  specification  procedures  may  be  partitioned  in  two  stages.  The  first 
stage  is  the  same  as  for  the  procedures  for  stationary  processes  discussed  in 
Section  13.3.2  and  consists  of  fitting  a  long  autoregression  by  least  squares  in 
order  to  obtain  estimates  of  the  unobservable  innovations  Ut,  t  =  1 ,  . . . ,  T. 

STAGE  I:  Use  multivariate  LS  estimation  to  fit  a  long  VAR(n)  process  to 
the  data  to  obtain  residuals  Ut(n).  ■ 

These  residuals  are  then  substituted  for  the  unknown  lagged  Ut  s  in  the 
individual  equations  of  an  ARMAre  form  which  may  then  be  estimated  by 
linear  LS  procedures.  Based  on  the  equations  estimated  in  this  way,  a  choice 
of  the  Kronecker  indices  is  made  using  model  selection  criteria.  Poskitt  & 
Lutkepohl  (1995),  Guo,  Huang  &  Hannan  (1990),  and  Huang  &  Guo  (1990) 
showed  that  the  estimated  residuals  ut(n)  are  “good”  estimates  of  the  true 
residuals  if  n  approaches  infinity  at  a  suitable  rate,  as  T  goes  to  infinity  (see 
Lemma  3.1  of  Poskitt  &  Lutkepohl  (1995)  for  details). 

The  methods  presented  in  the  following  differ  in  the  way  they  choose  the 
Kronecker  indices  in  the  next  step.  An  obvious  idea  may  be  to  search  over  all 
models  associated  with  Kronecker  indices 

{(pi,---,Pk)\0  <Pk  <Pm  ax  ?  k  —  1 ,  .  .  .  ,  K} 

for  some  prespecified  upper  bound  pmax  and  choose  the  set  of  Kronecker  in¬ 
dices  which  optimizes  some  model  selection  criterion,  as  in  Section  13.3.2  for 
the  stationary  case.  The  two  procedures  presented  in  the  following  are  more 
efficient  computationally  and  they  are  similar  to  Poskitt’s  procedure  presented 
in  Section  13.3.4.  The  first  variant  uses  linear  regressions  to  estimate  the  indi¬ 
vidual  equations  separately  for  different  lag  lengths.  A  choice  of  the  optimal 
lag  length  is  then  based  on  some  prespecified  criterion  similar  to  those  consid¬ 
ered  for  the  stationary  case.  The  following  formal  description  of  the  procedure 
is  taken  from  Poskitt  &  Lutkepohl  (1995). 

STAGE  II:  Proceed  in  the  following  steps. 
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(ia)  For  m  =  0,  set  T(j2(ra)  equal  to  the  residual  sum  of  squares  from  the 
regression  of  ykt  on  a  constant  and  (y.jt  —  Ujt(n)),  j  =  1, . . . ,  K ,  j  ■/  k. 
For  m  =  1, . . .  ,pmax  <  n,  regress  ykt  on  a  constant,  ( yJt  -  ujt(n)),  j  = 
1, . . . ,  K,  j  ^  fc,  and  yt-s  and  Ut-S(n),  s  =  1, . . . ,  m,  and  determine  the 
residual  sums  of  squares,  Ta\{m),  for  k  =  1, . . . ,  K. 

(ib)  For  k  =  1, . . . ,  K,  compute  a  selection  criterion  of  the  form 

Crfc(?n)  =  In al(m)  +  crm/T,  m  =  0,1,...  ,pmax, 

where  ct  is  a  function  of  T  which  will  be  specified  later. 

(ii)  Set  the  estimate  of  the  fc-tli  Kronecker  index  equal  to 

pk  =  arg min  Crfc(m),  k  =  l,...,K. 

TTifCpmax 


In  the  regressions  in  Step  (ia),  restrictions  from  the  echelon  structure  are 
not  explicitly  taken  into  account,  because  for  each  value  of  m,  the  algorithm 
implicitly  assumes  that  the  current  index  under  consideration  is  the  smallest 
and,  thus,  no  restrictions  are  imported  from  other  equations.  Still,  the  fc-th 
equation  will  be  misspecified  whenever  m  is  less  than  the  true  Kronecker 
index  because  in  that  case,  lagged  values  required  for  a  correct  specification 
are  omitted.  On  the  other  hand,  if  m  is  greater  than  the  true  Kronecker 
index,  the  fc-th  equation  will  be  correctly  specified  but  may  include  redundant 
parameters  and  variables.  Therefore,  it  is  intuitively  plausible  that  for  an 
appropriate  choice  of  ct,  the  criterion  function  Cr fc(?n)  will  be  minimized 
asymptotically  when  m  is  equal  to  the  true  Kronecker  index.  For  practical 
purposes,  possible  choices  of  Ct  are  ct  =  nlnT  or  ct  =  n2. 

At  Stage  II,  values  for  n,  pmax,  and  ct  have  to  be  chosen.  The  theoretical 
consistency  results  stated  in  Poskitt  (2003)  are  quite  general  and  provide 
an  asymptotic  justification  for  many  different  values  of  these  quantities.  The 
following  choices  may  be  considered  in  practice: 

•  Choose  n  by  AIC  or  use  n  =  max{(lnT)a,p(AIC)},  where  a  >  1. 

•  Choose  pmax  =  \n- 

•  Choose  ct  =  n  In  T  or  ct  =  n2 . 

Poskitt  &  Liitkepohl  (1995)  also  proposed  a  modification  of  Stage  II  which 
permits  to  take  into  account  coefficient  restrictions  derived  from  those  equa¬ 
tions  in  the  system  which  have  smaller  Kronecker  indices.  In  that  modification, 
after  running  through  Stage  II  for  the  first  time,  we  fix  the  smallest  Kronecker 
index  and  repeat  Stage  II,  but  search  only  those  equations  which  are  found 
to  have  indices  larger  than  the  smallest.  In  this  second  application  of  Stage 
II,  the  restrictions  implied  by  the  smallest  Kronecker  index  found  in  the  first 
round  are  taken  into  account  when  the  second  smallest  index  is  determined. 
We  proceed  in  this  way  by  fixing  the  smallest  Kronecker  index  found  in  each 
successive  round  until  all  the  Kronecker  indices  have  been  specified.  In  this 


14.4  Specification  of  EC-ARMA^b  Models  525 


procedure,  the  variables  are  ordered  in  such  a  way  that  the  Kronecker  indices 
of  the  final  system  are  ordered  from  largest  to  smallest.  That  is,  the  variable 
whose  equation  is  associated  with  the  smallest  Kronecker  index  is  placed  last 
in  the  list  of  variables.  The  one  with  the  second  smallest  Kronecker  index 
is  assigned  the  next  to  the  last  place  and  so  on.  For  details,  see  Poskitt  & 
Liitkepohl  (1995)  and  Poskitt  (2003). 

It  should  be  understood  that  the  Kronecker  indices  found  in  such  a  pro¬ 
cedure  for  a  given  time  series  of  finite  length  can  only  be  expected  to  be 
a  reasonable  starting  point  for  a  more  refined  analysis  of  the  system  under 
consideration.  Based  on  the  specified  Kronecker  indices,  a  more  efficient  pro¬ 
cedure  for  estimating  the  parameters  may  be  applied  and  the  model  may  be 
modified  subsequently. 

So  far,  we  have  not  discussed  the  choice  of  the  cointegrating  rank.  In 
practice,  of  course,  this  quantity  is  unlikely  to  be  known.  Comments  on  the 
estimation  of  r  will  be  given  in  the  following  subsection. 

14.4.2  Specification  of  the  Cointegrating  Rank 

Saikkonen  &  Luukkonen  (1997)  and  Liitkepohl  &  Saikkonen  (1999b)  showed 
that  Johansen’s  LR  tests  for  the  cointegrating  rank  (see  Section  8.2)  maintain 
their  asymptotic  properties  even  if  a  finite  order  VAR  process  is  fitted  although 
the  true  underlying  process  has  an  infinite  order  VAR  structure.  Consequently, 
these  tests  may  be  applied  at  Stage  I  of  the  present  specification  procedure. 
The  cointegrating  rank  is  then  determined  independently  of  the  Kronecker 
indices.  Alternatively,  Yap  &  Reinsel  (1995)  extended  the  likelihood  ratio 
principle  to  VARMA  processes  and  developed  cointegration  rank  tests  under 
the  assumption  that  identified  versions  of  A(z)  and  M(z)  are  used.  Thus, 
these  tests  may  be  applied  once  the  Kronecker  indices  have  been  specified. 
Whatever  approach  is  adopted,  for  our  purposes  the  following  modification  is 
noteworthy. 

If  a  Kronecker  index  pk  =  0,  the  variable  ykt  inherits  all  of  its  dynamics 
from  other  variables  in  the  system  and  it  is  known  from  (14.2.10)  that  the 
cointegrating  rank  r  >  g,  the  number  of  zero  Kronecker  indices.  Hence,  the 
testing  procedure  proceeds  by  considering  only  null  hypotheses  where  r  is 
greater  than  or  equal  to  g.  In  other  words,  the  following  sequence  of  null 
hypotheses  is  tested:  Hy  :  r  =  g,  iJu  :  r  =  g  +  1,  . . .,  i70  :  r  =  K  —  1.  The 
estimator  of  r  is  chosen  such  that  it  is  the  smallest  value  for  which  Hy  cannot 
be  rejected. 

Once  a  model  has  been  estimated,  some  checks  for  model  adequacy  are  in 
order  and  possible  further  model  reductions  or  modifications  may  be  called 
for.  For  instance,  insignificant  parameter  estimates  may  be  restricted  to  zero. 
Here  it  is  convenient  that  the  f-ratios  of  the  short-run  parameters  have  their 
usual  asymptotic  standard  normal  distributions  under  the  null  hypothesis, 
due  to  the  asymptotic  normal  distribution  of  the  ML  estimators.  Thus,  they 
can  be  used  for  significance  tests  in  the  usual  way  and  may  help  to  place  over 
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identifying  restrictions  on  the  parameters.  Moreover,  a  detailed  analysis  of  the 
residual  properties  should  be  performed  to  reveal  possible  model  deficiencies. 
The  checks  for  model  adequacy  described  in  Chapters  4  and  8  can  be  used 
here  as  well  with  appropriate  modifications. 


14.5  Forecasting  Cointegrated  VARMA  Processes 

Forecasting  cointegrated  VARMA  processes  proceeds  completely  analogously 
to  forecasting  stationary  VARMA  processes.  The  same  formulas  can  be  used. 
Like  for  pure  VAR  models,  the  properties  of  the  forecasts  will  be  different, 
however.  In  particular,  the  forecast  error  covariance  matrices  will  be  un¬ 
bounded  for  increasing  forecast  horizon.  Hence,  also  forecast  intervals  will 
be  unbounded  in  length.  In  this  respect,  the  properties  of  the  forecasts  are 
analogous  to  those  of  cointegrated  pure  VAR  processes.  The  reader  is  referred 
to  Section  6.5  for  details. 
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For  illustrative  purposes,  we  use  an  example  from  Liitkepohl  &  Claessen 
(1997),  based  on  the  U.S.  macroeconomic  data  which  were  also  considered 
in  Section  7.4.3.  The  data  are  available  in  File  E3.  It  consists  of  136  quarterly 
observations  for  the  years  1954.1  to  1983.4  of  the  real  money  stock  Ml  (yu), 
GNP  in  billions  of  1982  dollars  (y2t)i  the  discount  interest  rate  on  new  issues 
of  91-day  treasury  bills  ( y^t ),  and  the  yield  on  long  term  (20  years)  treasury 
bonds  (j/4t).  Logarithms  of  seasonally  adjusted  GNP  and  Ml  data  are  used. 
Thus,  yt  is  a  four-dimensional  vector.  Notice  that  we  do  not  use  the  full  sam¬ 
ple  period  covered  in  File  E3  but  truncate  the  data  for  the  last  four  years. 
The  reason  is  that  in  the  exercises  readers  are  asked  to  perform  a  forecast 
comparison  based  on  the  model  presented  in  the  following.  The  data  for  the 
years  1984-1987  are  set  aside  for  this  comparison. 

Following  the  procedure  outlined  in  Section  14.4.2,  the  cointegrating  rank 
may  be  determined  with  LR  type  tests  applied  to  a  long  VAR  model.  After 
running  through  an  extensive  specification  procedure,  Liitkepohl  &  Claessen 
(1997)  finally  specified  an  EC-ARMA/{B(2, 1, 1, 1)  model  with  cointegrating 
rank  1  for  the  data  generation  process  of  this  system  and  obtained  the  follow¬ 
ing  estimated  model 
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Estimated  standard  errors  are  given  in  parentheses.  The  cointegration  vector 
P'  =  [1,  —.343,  —16.72, 19.35]  was  obtained  by  estimating  a  VECM  with  one 
lagged  difference  of  yt  and  with  cointegrating  rank  1,  using  the  ML  procedure 
presented  in  Section  7.2.3  and  normalizing  the  first  element  of  P  to  be  1. 

Some  of  the  parameter  values  in  (14.6.1)  are  quite  small  compared  to  their 
estimated  standard  errors.  In  particular,  some  of  them  are  not  significant 
under  a  two-standard  error  criterion.  Therefore,  zero  restrictions  were  placed 
on  the  coefficients  and  the  following  estimated  model  was  obtained: 


1 

-.476 

(.107) 

0 

.145 

(.032) 


0  0  cr 

10  0 


0  10 
0  0  1 


Ayt 
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.094 

-.042 

(.035) 

(.016) 

.219 

-.096 

(.056) 

1 

(.026) 

.207 

1 

-.094 

(.057) 

(.026) 

.069 

-.031 

(.020) 

(.009) 

0 

.331 

(.104) 

.162 

(.042) 


[1, -.343, -16.72, 19.35]yt_i 


.772 

(.063) 

0 

0 

0 

1 

-.476 

(.107) 

0 

.145 

(.032) 


.087 

(.067) 

0 

0 

0 

0  0  0 
1  0  0 

0  10 
0  0  1 


.788 

(.100) 

0 

0 

0 


ut 


.198 

(.195) 

0 

0 

0 


Ayt- i 


-.640  0  0 

(.082) 


0 


0 


0  .339 

(.083) 

0  .110 

(.054) 


.107  0  .233 

(.081)  (.114) 

0  0  0 

0  0  0 

0  0  0 


1.105 

(.297) 

0 

-.323 

(.095) 

.984 

(.196) 

0 
0 
0 


Ut- 1 


Ut- 2- 


(14.6.2) 


This  example  is  just  meant  to  illustrate  that  the  procedures  presented 
in  this  chapter  are  indeed  feasible  in  practice.  The  reader  is  encouraged  to 
perform  a  forecast  comparison  of  the  model  presented  here  with  pure  VAR 
models  and  VECMs  for  the  data  (see  Problem  14.6). 


14.7  Exercises 

14.7.1  Algebraic  Exercises 

Problem  1^.1 

Show  that  in  the  model  (14.2.9),  rk(II)  >  g,  where  g  is  the  number  of  Kro- 
necker  indices  which  are  zero.  (Hint:  Consider  the  matrix  Aq  —  Ai). 
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Problem  14-2 

Write  down  an  ARMA#£;(2, 1, 2)  model  explicitly  in  matrix  form  and  also 
write  down  the  corresponding  EC-ARMA^  form. 


Problem  14-3 

Consider  the  following  EC-ARMA^j^  model: 


1  0 
0:21,0  1 


Ayt 


TTll  7T12 
7T21  7T22 


Vt-1  + 


711,1 

0 


1  0 

021,0  1 


Ut  + 


m  u,! 

m2i,i 


712,1 

0 

0 

W22.1 


Ayt-i 


Ut- 1- 


(a)  Write  the  model  in  ARMA^#  form. 

(b)  Specify  the  Kronecker  indices. 

(c)  How  many  over-identifying  restrictions  are  present  in  this  model? 

(d)  Write  the  model  in  pure  VAR  form. 


14.7.2  Numerical  Exercises 

The  following  problems  are  based  on  the  U.S.  data  given  in  File  E3,  as  de¬ 
scribed  in  Section  14.6.  The  variables  are  defined  in  the  same  way  as  in  that 
section.  Thus,  a  system  of  dimension  four  is  considered. 

Problem  14-4 

Fit  a  pure  VAR  model  to  the  four-dimensional  data  set  without  considering 
integration  and  cointegration  properties  of  the  variables.  Use  only  the  data 
for  1954.1-1983.4  for  modelling  and  estimation.  Compute  forecasts  from  the 
model  for  the  period  1984.1-1987.4. 

Problem  14-5 

Use  the  following  steps  in  constructing  VECMs  for  the  period  1954.1-1983.4 
and  computing  forecasts  of  the  four  variables  for  the  period  1984.1-1987.4. 

(a)  Determine  the  cointegrating  rank  of  the  system. 

(b)  Estimate  the  cointegration  relation  (s)  with  the  reduced  rank  ML  and  the 
EGLS  methods  discussed  in  Chapter  7. 

(c)  Construct  subset  VECMs  based  on  the  estimated  cointegration  relations 
from  the  previous  step. 

(d)  Confirm  that  the  models  obtained  in  the  previous  steps  are  adequate  rep¬ 
resentations  of  the  data  generation  process. 

(e)  Compute  forecasts  from  your  model  for  the  period  1984.1-1987.4. 

Problem  14-6 

Compare  the  forecasts  obtained  in  Problems  14.4  and  14.5  with  those  from  the 
EC-ARMA#£  model  (14.6.2)  on  the  basis  of  the  MSEs.  Discuss  the  results. 
(Hint:  See  Liitkepohl  &  Claessen  (1997).) 
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Fitting  Finite  Order  VAR  Models  to  Infinite 
Order  Processes 


15.1  Background 

In  the  previous  chapters,  we  have  derived  properties  of  models,  estimators, 
forecasts,  and  test  statistics  under  the  assumption  of  a  true  model.  We  have 
also  argued  that  such  an  assumption  is  virtually  never  fulfilled  in  practice.  In 
other  words,  in  practice,  all  we  can  hope  for  is  a  model  that  provides  a  useful 
approximation  to  the  actual  data  generation  process  of  a  given  multiple  time 
series.  In  this  chapter,  we  will,  to  some  extent,  take  into  account  this  state  of 
affairs  and  assume  that  an  approximating  rather  than  a  true  model  is  fitted. 
Specifically,  we  assume  that  the  true  data  generation  process  is  an  infinite 
order  VAR  process  and,  for  a  given  sample  size  T,  a  finite  order  VAR(p)  is 
fitted  to  the  data. 

In  practice,  it  is  likely  that  a  higher  order  VAR  model  is  considered  if  the 
sample  size  or  time  series  length  is  larger.  In  other  words,  the  order  p  increases 
with  the  sample  size  T.  If  an  order  selection  criterion  is  used  in  choosing  the 
VAR  order,  the  maximum  order  to  be  considered  is  likely  to  depend  on  T . 
This  again  implies  that  the  actual  order  chosen  depends  on  the  sample  size 
because  it  will  depend  on  the  maximum  order.  In  summary,  the  actual  order 
selected  may  be  regarded  as  a  function  of  the  sample  size  T.  In  order  to  derive 
statistical  properties  of  estimators  and  forecasts,  we  will  make  this  assumption 
in  the  following.  More  precisely,  we  will  assume  that  the  VAR  order  goes  to 
infinity  with  the  sample  size.  Under  that  assumption,  an  asymptotic  theory 
has  been  developed  that  will  be  discussed  in  this  chapter. 

In  Section  15.2,  the  assumptions  for  the  underlying  true  process  and  for 
the  order  of  the  process  fitted  to  the  data  are  specified  in  detail  and  asymp¬ 
totic  estimation  results  are  provided  for  stable  processes.  In  Section  15.3, 
the  consequences  for  forecasting  are  discussed  and  impulse  response  analysis 
is  considered  in  Section  15.4.  Our  standard  investment/income/consumption 
example  is  used  to  contrast  the  present  approach  to  that  considered  in  Chap¬ 
ter  3,  where  a  true  finite  order  process  is  assumed.  Finally,  in  Section  15.5, 
extensions  to  cointegrated  processes  are  discussed. 
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15.2  Multivariate  Least  Squares  Estimation 

Suppose  the  generation  process  of  a  given  multiple  time  series  is  a  stationary, 
stable,  A'-dimensional,  infinite  order  VAR  process, 

OO 

Vt  =  Y^  niVt-i  +  Ut >  (15.2.1) 

2=1 

with  absolutely  summable  11  i,  that  is, 

OO 

5>A||<oo  (15.2.2) 

2=1 

(see  Appendix  C.3)  and  canonical  MA  representation 

OO 

Ut  =  y ^$jUt-i,  $0  =  IK,  (15.2.3) 

»= o 

satisfying 

(OO  \  OO 

^  A,;2*  J  ^  0  for  \z\  <  1  and  y\1/2||<7i||  <  oo.  (15.2.4) 
»= o  /  »= 1 

The  zero  mean  assumption  implied  by  these  conditions  is  not  essential  and 
is  imposed  for  convenience  only.  Stable,  invertible  VAR.MA  processes  satisfy 
the  foregoing  conditions.  The  assumptions  allow  for  more  general  processes, 
however.  Of  course,  the  generation  process  may  also  be  a  stable,  finite  order 
VAR(p)  in  which  case  lit  =  0  for  i  >  p. 

We  have  argued  in  the  previous  section  that  in  practice  the  true  structure 
will  usually  be  unknown  and  the  investigator  may  consider  fitting  a  finite 
order  VAR  process  with  the  VAR  order  depending  on  the  length  T  of  the 
available  time  series.  For  this  situation,  Lewis  &  Reinsel  (1985)  have  shown 
consistency  and  asymptotic  normality  of  the  multivariate  LS  estimators.  For 
univariate  processes,  similar  results  were  discussed  earlier  by  Berk  (1974)  and 
Bhansali  (1978). 

To  state  these  results  formally,  we  use  the  following  notation: 
n(n)  :=  [ill, .  •  •  ,iin], 

7T  (n)  :=  vec  17 (n). 

Fitting  a  VAR(n)  process,  the  i-tli  estimated  coefficient  matrix  is  denoted  by 

£i(n), 

n(n)  :=  [ni(n), . . . ,  77„(n)], 
and 

7r  (n)  :=  vec  17(n). 

Now  we  can  state  a  result  of  Lewis  &  Reinsel  (1985). 
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Proposition  15.1  ( Properties  of  the  LS  Estimator  of  an  Approximating 
VAR  Model) 

Let  the  multiple  time  series  yi,. . .  ,pt  be  generated  by  a  potentially  infinite 
order  VAR  process  satisfying  (15.2.1)-(15.2.4)  with  standard  white  noise  wt. 
Suppose  finite  order  VAR, (nr)  processes  are  fitted  by  multivariate  LS  and 
assume  that  the  order  ut  depends  upon  the  sample  size  T  such  that 


ut  — > ►  oo,  n^/T  — +  0,  and  VT  Y^  IUA||  — 5 ►  0  as  T  — >  oo. 

i—riT  + 1 

(15.2.5) 

Furthermore,  let  Ci,  c-i  be  positive  constants  and  f  (n)  a  sequence  of  ( I<2n  x  1) 
vectors  such  that 


0  <  Ci  <  f(n)'f(n)  <  c2  <  oo  for  n  =  1,  2, _ 

Then 

VT  -  nT  i(nTy[n(nT)  -  Tv(nT)]  d  ;  . .,Q  ^ 

[i{nT)'{En-r  ®  A’„)f(nT)]1/2 

where 


( 

yt 

\ 

En  E 

. 

: 

Wv  ■  ■ 

•  ’  Vt— n+l]  1 

K 

Ut— n+1  J 

/ 

(15.2.6) 


(15.2.7) 


Remark  1  The  assumption  (15.2.5)  means  that,  although  the  VAR  order 
has  to  go  to  infinity  with  the  sample  size,  it  has  to  do  so  at  a  much  slower 
rate  because  n\/T  — >  0.  The  requirement 

OO 

Vf  IUA||  — ►  0  (15.2.8) 

i=riT+l 

is  always  satisfied  if  yt  is  actually  a  finite  order  VAR  process  and  nr  — >  oo. 
For  infinite  order  VAR  processes,  this  condition  implies  a  lower  bound  for  the 
rate  at  which  nr  goes  to  infinity.  To  see  this,  consider  the  univariate  MA(1) 
process 


yt  =  ut  -  mut- 1, 

where  0  <  |m|  <  1  to  ensure  invertibility.  Its  AR  representation  is 

OO 

Vt-  -  "Y  TOV-i  +  ut 

i= 1 
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and  condition  (15.2.8)  becomes 

OO 

vt  im* 

i=n-\- 1 


Here  the  subscript  T  has  been  dropped  from  tit  for  notational  simplicity.  In 
this  example,  nr  =  T1^  with  e  >  3  is  a  possible  choice  for  the  sequence  Ht 
that  satisfies  both  (15.2.9)  and  n^/T  — *  0.  On  the  other  hand,  %  =  lnlnT 
is  not  a  permissible  choice  because  in  this  case 

Vf\ m\nr+1 

does  not  approach  zero  as  T  — >  oo.  This  result  is  easily  established  by  consid¬ 
ering  the  logarithm  of  (15.2.9), 

\  InT  +  ( iit  +  1)  In  \m\  —  ln(l  —  \m\), 

which  goes  to  infinity  for  nr  =  In  In  T. 

In  summary,  (15.2.8)  is  a  lower  bound  and  n^/T  — »  0  establishes  an  upper 
bound  for  the  rate  at  which  tit  has  to  go  to  infinity  with  the  sample  size  T.  ■ 

Remark  2  Proposition  15.1  implies  that  for  fixed  m, 

VT  ~  nT  vec ( [Si  (nT ) ,  •  ■  • ,  Um(nr)\  -  [nlt . . . ,  llm\) 

has  an  asymptotic  multivariate  normal  distribution  with  mean  zero  and  covari¬ 
ance  matrix  V  ®  Su,  where  V  is  obtained  as  follows:  Let  Vn  be  the  upper  left- 
hand  ( KmxKm )  block  of  the  inverse  of  7’,,,  for  n>  m.  Then  V  =  linin^oo  Vn. 
Loosely  speaking,  V  is  the  upper  left-hand  (Km  x  Km)  block  of  the  inverse 
of  the  infinite  order  matrix 


=  VT\m\n+1  ^  | m\l 


2—0 


=  Vt 


ln+1 


1  —  \m\  T- 


o. 


(15.2.9) 


( 

yt 

\ 

E 

Vt- 1 

[y't,y't- !*••■] 

\ 

L  :  . 

) 

Thus,  the  result  can  be  used  for  inference  on  a  finite  number  of  parameters. 
It  is  also  possible,  however,  to  use  the  result  from  Proposition  15.1  to  con¬ 
struct  tests  for  hypotheses  involving  an  infinite  number  of  restrictions.  Such 
hypotheses  can  arise  in  studying  Granger-causality  in  infinite  order  VAR  pro¬ 
cesses.  This  case  was  considered  explicitly  by  Liitkepohl  &  Poskitt  (1996). 


Remark  3  If  the  data  generation  process  has  nonzero  mean  originally,  the 
sample  mean  y  may  be  subtracted  initially  from  the  data.  It  is  asymptotically 
independent  of  the  ll,  (ri'i')  and  has  an  asymptotic  normal  distribution, 


15.2  Multivariate  Least  Squares  Estimation  535 


Vf{y-n)±M{0,Ey), 

where 


Sv=  ^  X>  • 


\i— 0 


A  corresponding  result  from  Liitkepohl  &  Poskitt  (1991)  for  the  white 
noise  covariance  matrix  is  stated  next. 

Proposition  15.2  ( Asymptotic  Properties  of  the  White  Noise  Covariance 
Matrix  Estimator) 

Let 


ut{n)  :=  yt  ~  Y.  Ui{n)yt-i,  t  =  l,...,T, 

i=l 

be  the  multivariate  LS  residuals  from  a  VAR(n)  model  fitted  to  a  multiple 
time  series  of  length  T,  let 

1  T 

su{n)  ■=  -^2,ut{n)ut{n)' 

t=l 

be  the  corresponding  estimator  of  the  white  noise  covariance  matrix  and  let 
U  :=  [ui, . . . ,  ut]  so  that 

T 

J.UU'  =  l-Y,utu't 
t= i 

is  an  estimator  of  £u  based  on  the  true  white  noise  process  Ut .  Then,  under 
the  conditions  of  Proposition  15.1, 

plim  Vf(Su{nT)  -  T  XUU')  =  0. 


We  know  from  Chapter  3,  Propositions  3.2  and  3.4,  that,  for  a  Gaussian 
process,  T-1t/[/'  has  an  asymptotic  normal  distribution, 

Vfvech (T~1UU'  -  Eu)  4  Af(0,  2D+  (Eu  <8>  EU)D+'),  (15.2.10) 

where,  as  usual,  =  (D^.D/f  )_1D^  is  the  Moore-Penrose  inverse  of  the 
( K 2  x  ),  K ( K  +  1))  duplication  matrix  D^-.  Using  Proposition  C.2(2)  of  Ap¬ 
pendix  C.l,  Proposition  15.2  implies  that 
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VT  vech (Su{nT)  -  Su) 

has  precisely  the  same  asymptotic  distribution  as  the  one  in  (15.2.10).  Obvi¬ 
ously,  this  distribution  does  not  depend  on  the  VAR  structure  of  yt  or  the  VAR 
coefficients.  In  addition,  the  estimator  £u(nT)  is  asymptotically  independent 
of  7T (jit)-  In  the  following,  the  consequences  of  these  results  for  prediction 
and  impulse  response  analysis  will  be  discussed. 


15.3  Forecasting 

15.3.1  Theoretical  Results 

Suppose  the  VAR(riT)  model  estimated  in  the  previous  section  is  used  for 
forecasting.  In  that  case,  the  usual  h- step  forecast  at  origin  T,  yr(^),  can  be 
computed  recursively  for  h  =  1,2, ... ,  using 

TL'j' 

yr(h)  =  ^2  ni(nT)yT(h  -  i),  (15.3.1) 

i—1 

where  yr{j)  ■=  Vr+j  for  j  <  0  (see  Section  3.5).  We  use  the  notation 


'  M i)  1 

'  Vt( i)  1 

yr+1 

. 

• 

i  y r(h)  := 

. 

• 

i  y  T,h  ■= 

.  yr{h)  \ 

.  Vr{h)  \ 

.  VT+h  _ 

and 


£y(/i)  :=  E  {[yT,h  -  yr{h)][yT,h  ~  yr(M]'}  , 

where  yr(j),  j  =  1,  •  • . ,  h,  is  the  optimal  j-step  forecast  at  origin  T  based  on 
the  infinite  past,  that  is, 

OO 

yr(j)  =  niVT(j  -  i) 

i= 1 

with  yr{i)  ■=  yr+i  for  i  <  0  (see  Section  11.5).  The  following  result  is  also 
essentially  due  to  Lewis  &  Reinsel  (1985)  (see  also  Liitkepohl  (1987,  Section 
3.3,  Proposition  3.2)). 

Proposition  15.3  ( Asymptotic  Distributions  of  Estimated  Forecasts) 

Under  the  conditions  of  Proposition  15.1,  if  yt  is  a  Gaussian  process  and  if 
independent  processes  with  identical  stochastic  structures  are  used  for  esti¬ 
mation  and  forecasting,  respectively,  then 
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Remark  1  The  proposition  implies  that  for  large  samples  the  forecast  vector 
yr(ft)  has  approximate  MSE  matrix 

Sy(/0=  (l+^)SyW-  (15.3.2) 

This  result  can  be  seen  by  noting  that 

E{[yr,h  -  yT(h)][yT,h  -  y T(h)]'} 

=  E{[yT,h  ~  yT(h)][yT,h  ~  yrih)]'} 

+  E{[yT(h)  -  yT(h)][yT(h)  -y±fi)]'} 

and  approximating  the  last  term  via  the  asymptotic  result  of  Proposition  15.3. 

■ 

Remark  2  An  approximation  for  the  MSE  matrix  of  an  h- step  forecast  yrih) 
follows  directly  from  (15.3.2), 

Ey(h)  =  (l  +  h  =  1,2,....  (15.3.3) 

In  Section  3.5.1,  we  have  obtained  an  approximate  MSE  matrix 

Uv(h)  =  Sy(h)  +  ±n{h)  (15.3.4) 

for  an  /i-step  forecast  based  on  an  estimated  VAR  process  with  known  finite 
order.  If  in  Chapter  3  the  process  mean  is  known  to  be  zero  and  is  not  esti¬ 
mated,  it  can  be  shown  that  12(/i)  approaches  zero  as  h  — >  oo.  In  other  words, 
the  MSE  part  due  to  estimation  variability  goes  to  zero  as  the  forecast  horizon 
increases.  The  same  does  not  hold  in  the  present  case.  In  fact,  the  Ey(h)’ s  are 
monotonically  nondecreasing  for  growing  h,  that  is, 

Sy(h)  >  Zy(i),  for  h  >  i. 

The  explanation  for  this  result  is  that,  under  the  present  assumptions,  increas¬ 
ingly  many  parameters  are  estimated  with  growing  sample  size.  For  a  zero 
mean  VAR  process  with  known  finite  order,  the  optimal  forecast  approaches 
the  process  mean  of  zero  when  the  forecast  horizon  gets  large  and,  thus,  the 
estimated  VAR  parameters  do  not  contribute  to  the  forecast  uncertainty  for 
long-run  forecasts.  The  same  is  not  true  under  the  present  conditions,  where 
the  VAR  order  goes  to  infinity.  ■ 

Remark  3  We  have  also  seen  in  Section  3.5.2  that  12(1)  =  ( Kp  +  1)EU  for 
a  A'-dimensional  VAR(p)  process  with  estimated  intercept  term.  It  is  easy  to 
see  that,  if  the  process  mean  is  known  to  be  zero  and  the  mean  term  is  not 
estimated,  12(1)  =  KpSu.  Hence,  in  that  case, 

EyiX)  =  £y{\)  +  =  ZJU  +  ^-Eu  =  Sy(  1), 
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if  nr  =  p ■  In  other  words,  for  1-step  ahead  forecasts,  the  two  MSE  approx¬ 
imations  are  identical  if  the  same  VAR  orders  are  used  in  both  approaches. 
It  is  easy  to  see  that  the  same  does  not  hold  in  general  for  predictions  more 
than  1  step  ahead  (see  Problem  15.2).  ■ 

Remark  4  Because  forecasts  can  be  obtained  from  finite  order  approxima¬ 
tions  to  infinite  order  VAR  processes,  we  may  also  base  the  prediction  tests  for 
structural  change  considered  in  Sections  4.6.2  and  13.5.3  on  such  approxima¬ 
tions.  Of  course,  in  that  case  the  MSE  approximation  implied  by  Proposition 
15.3  should  be  used  in  setting  up  the  test  statistics.  For  instance,  a  test  statis¬ 
tic  based  on  h- step  forecasts  would  be 

Th  =  (l jT+h  -  yT(h)ySy(h)~1(yT+h  ~  yr(h)), 

where  Zg(/i)  is  an  estimator  of  Ey(h).  ■ 

Remark  5  If  yt  is  a  process  with  nonzero  mean  vector  p,  then  the  sample 
mean  may  be  subtracted  from  the  original  data  and  the  previous  analysis  may 
be  performed  with  the  mean-adjusted  data.  If  the  sample  mean  is  added  to 
the  forecasts,  an  extra  term  should  be  added  to  the  approximate  MSE  matrix. 
A  term  similar  to  that  resulting  from  an  estimated  mean  term  in  a  finite  order 
VAR  setting  with  known  order  may  be  added  (see  Problem  3.9,  Chapter  3).  ■ 


15.3.2  An  Example 

To  illustrate  the  effects  of  approximating  a  potentially  infinite  order  VAR 
process  by  a  finite  order  model,  we  use  again  the  West  German  invest¬ 
ment/income/consumption  data  from  File  El.  The  variables  2/1,2/21  and  2/3 
are  defined  as  in  Chapter  3,  Section  3.2.3,  and  we  use  the  same  sample  pe¬ 
riod  1960-1978  and  a  VAR  order  nr  =  2.  That  is,  we  assume  that  the  VAR 
order  depends  on  the  sample  size  in  such  a  way  that  nx  =  2  for  T  =  73. 
Note  that  the  condition  (15.2.5)  for  the  VAR  order  is  an  asymptotic  condi¬ 
tion  that  leaves  open  the  actual  choice  in  finite  samples.  Therefore,  we  choose 
the  VAR  order  that  was  suggested  by  the  AIC  criterion  in  Chapter  4  and, 
thus,  we  use  the  same  VAR  order  as  in  Chapter  3.  As  a  consequence,  the 
point  forecasts  obtained  under  our  present  assumptions  are  the  same  one  gets 
from  a  mean-adjusted  model  under  the  conditions  of  Chapter  3.  The  inter¬ 
val  forecasts  obtained  under  the  different  sets  of  assumptions  are  different  for 
h  >  1,  however,  because  the  approximate  MSE  matrices  are  different.  We 
have  estimated  Sy{h)  by 


where  the  ^,’s  and  Su  are  obtained  from  the  VAR(2)  estimates,  as  in  Section 
3.5.3,  and  Gy(h)/T  is  a  term  that  takes  account  of  the  fact  that  the  mean 
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term  is  estimated  in  addition  to  the  VAR  coefficients.  It  is  the  same  term  that 
is  used  if  a  VAR(2)  process  with  true  order  p  =  2  is  assumed  and  the  model 
is  estimated  in  mean-adjusted  form  (see  Problem  3.9). 


Table  15.1.  Interval  forecasts  from  a  VAR(2)  model  for  the  investment /in¬ 
come/consumption  example  series  based  on  different  asymptotic  theories 


variable 

forecast 

horizon 

point 

forecast 

95%  interval  forecasts 
based  on  known  based  on  infinite 

order  assumption  order  assumption 

investment 

1 

-.010 

[-.105,  .085] 

[-.105. 

.085] 

2 

.012 

[-.087,  .110] 

00 

oo 

o 

.112] 

3 

.022 

[-.075,  .119] 

00 

o 

.122] 

4 

.013 

[-.084.  .111] 

00 

'X' 

o 

.114] 

income 

1 

.020 

[-.004,  .044] 

\ 

o 

o 

.044] 

2 

.020 

[-.004,  .045] 

[-.005, 

.045] 

3 

.017 

[-.007,  .042] 

CO 

o 

o 

.042] 

4 

.021 

[-.004,  .045] 

[-.005, 

.047] 

consumption 

1 

.022 

[  .002,  .041] 

[  .002, 

.041] 

2 

.015 

[-.005,  .035] 

lO 

o 

o 

.035] 

3 

.020 

[-.002,  .042] 

[-.002, 

.042] 

4 

.019 

[-.003,  .041] 

[-.003, 

.041] 

We  have  used  the  approximate  forecast  MSEs  from  (15.3.5)  to  set  up  fore¬ 
cast  intervals  under  Gaussian  assumptions  and  give  them  in  Table  15.1.  For 
comparison  purposes  we  also  give  forecast  intervals  obtained  from  a  VAR(2) 
process  in  mean-adjusted  form  based  on  the  asymptotic  theory  of  Chapter 
3,  assuming  that  the  true  order  is  p  =  2.  As  we  know  from  Remark  3  in 
Section  15.3.1,  the  1-step  forecast  MSEs  are  the  same  under  the  two  com¬ 
peting  assumptions.  For  larger  forecast  horizons,  most  of  the  intervals  based 
on  the  infinite  order  assumption  become  slightly  wider  than  those  based  on 
the  known  finite  order  assumption,  as  expected  on  the  basis  of  Remark  2  in 
Section  15.3.2.  For  our  sample  size,  the  differences  are  quite  small,  though. 

Which  of  the  two  sets  of  forecast  intervals  should  we  use  in  practice?  This 
question  is  difficult  to  answer.  Assuming  a  known  finite  VAR  order  is,  of 
course,  more  restrictive  and  less  realistic  than  the  assumption  of  an  unknown 
and  possibly  infinite  order.  The  additional  uncertainty  introduced  by  the  lat¬ 
ter  assumption  is  reflected  in  the  wider  forecast  intervals.  It  may  be  worth 
noting,  however,  that  such  a  result  is  not  necessarily  obtained  in  all  practical 
situations.  In  other  words,  there  may  be  time  series  and  generation  processes 
for  which  the  infinite  order  assumption  actually  leads  to  smaller  forecast  in¬ 
tervals  than  the  assumption  of  a  known  finite  VAR  order  (see  Problem  15.2). 
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Under  both  sets  of  assumptions,  the  MSE  approximations  are  derived  from 
asymptotic  theory  and  little  is  known  about  the  small  sample  quality  of  these 
approximations.  Both  approaches  are  based  on  a  set  of  assumptions  that  may 
not  hold  in  practice.  Notably  the  stationarity  and  normality  assumptions  may 
be  doubtful  in  many  practical  situations.  Given  all  these  reservations,  there 
is  still  one  argument  in  favor  of  the  present  approach,  assuming  a  potentially 
infinite  VAR  order.  For  h  >  1,  the  MSE  approximation  in  (15.3.3)  is  generally 
simpler  to  compute  than  the  one  obtained  in  Chapter  3. 


15.4  Impulse  Response  Analysis  and  Forecast  Error 
Variance  Decompositions 

15.4.1  Asymptotic  Theory 

For  a  researcher  who  does  not  know  the  true  structure  of  the  data  generating 
process,  it  is  possible  to  base  an  impulse  response  analysis  or  forecast  error 
variance  decomposition  on  an  approximating  finite  order  VAR  process.  Given 
the  results  of  Section  15.2,  we  can  now  study  the  consequences  of  such  an 
approach.  As  in  Sections  2.3.2  and  2.3.3,  the  quantities  of  interest  here  are 
the  forecast  error  impulse  responses, 

1 

A  inr  *  =  1,2,...,  <P0  =  Ik, 

j= i 

the  accumulated  forecast  error  impulse  responses, 

m 

'K,, = to = o,  i, ... , 

i= 0 

the  responses  to  orthogonalized  impulses, 

Oi  =  $iP ,  i  =  0, 1, . . . , 

where  P  is  the  lower  triangular  matrix  obtained  by  a  Choleski  decomposition 
of  Su,  the  accumulated  orthogonalized  impulse  responses, 

m 

“m  —  ^  ^  —  0,  1, .  .  .  , 

2  =  0 

and  the  forecast  error  variance  components, 
h-1 

Vjk,h  =  y^(e/?-0^efc)2/MSEJ-(/i),  h  =  1,2,..., 

2=0 


where  is  the  /c-th  column  of  Ik  and 
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h- 1 

MSE j(h)  =  V  e'^iSu^ej 
i= o 

is  the  j- th  diagonal  element  of  the  MSE  matrix,  Uy(h),  of  an  ft,-step  forecast. 

Estimators  of  these  quantities  are  obtained  from  the  and  in 

the  obvious  way.  For  instance,  estimators  for  the  s  are  obtained  recursively 
as 


&i{nT)  =  *  =  1,2, ... , 

i=i 

with  <?o(«t)  =  Ik,  and 


@i(nT)  =  @i(nT)P(nT),  i  =  0,1,..., 

are  estimators  of  the  (9,’s.  Here  P(nr)  is  the  unique  lower  triangular  matrix 
with  positive  main  diagonal  for  which 

P(nT)P{nTy  =  Su{nT). 


The  asymptotic  distributions  of  all  the  estimators  are  given  in  the  next  propo¬ 
sition.  Proofs,  based  on  Propositions  15.1  and  15.2,  are  given  by  Liitkepohl 
(1988a)  and  Liitkepohl  &  Poskitt  (1991). 

Proposition  15.4  ( Asymptotic  Distributions  of  Impulse  Responses) 

Under  the  conditions  of  Proposition  15.2,  the  impulse  responses  and  forecast 
error  variance  components  have  the  following  asymptotic  normal  distributions: 


i- 1 


Vf  vec  (<Pi(nr)  -  $i)  — ►A/’  [  0,  Pf1  ®  ^  ]  ,  /  =  1.2,... 

3=0 


(15.4.1) 


rri  rri  l  —  l 

VT  vec -  Vm)  [°.V®EEE 

k—1  1=1  j= 0 


m  =  1,2,.. .,  with  <Pj  :=  0  for  j  <  0; 

Vf  vec (Oi(nT)  -  Oi)  V  A^(0,  Dg(i)),  i  =  0, 1, ... , 


(15.4.2) 


(15.4.3) 


where 
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f2g(i) 


H  =  L'K[LK(IK2  +  K KK)(P  ®  Ik)Vk\~\ 

L k  is  the  {\K(K  +  1)  x  A'2)  elimination  matrix, 

K kk  is  the  (AT2  x  A'2)  commutation  matrix, 
and  is  the  asymptotic  covariance  matrix  of  \JT  vech(T^1  Y0t=\  utu't  —  £u); 

VT  vec -  Sm)  -^Af(0,  125(m)),  m  =  1,2, ,  (15.4.4) 


where 

rri  rri 

=  EE 


i- i 


k= 0  1=0 
with  <Pj  :=  0  for  j  <  0; 


Ik  ®  *jsu&k-i+j  +  Ok  $  $i)HZ*H\lK  £g)  &h) 

3=0 


Vf(u>jkth{nT) -u>jkth)-^Af(0,ajkh),  h  =  1,2,...,  j,k  =  1, . . . ,  Ki 

(15.4.5) 


where 


h— 1  h—  1 


^ jk,h  EE  9jk,h(l) 


1—0  m= 0 


m—  1 


-m+i 


/K  ®  X]  $iSu$'l- 
i= 0 

+  {lK®$m)HZ*H'{IK®&l) 


9jk,h{m)' 


with 

gjk,h{m) 


=  2 


(e'fc  ®  e')(e'0mefe)MSE,(/i) 


fe-l 


-  (e' ®  e' )  ^(e'  0,.ek)2  /  MSE,  (/i) 


i—0 


Remark  1  In  the  proposition,  it  is  ignored  that  cr2fc  h  may  be  zero,  in  which 
case  the  asymptotic  normal  distribution  is  degenerate.  In  particular,  <r2fc  h  =  0 
if  0Jjk,h  —  0.  This  result  is  easily  seen  by  noting  that  u>jk,h  is  zero  if  and  only 
if  Ojk, o  =  •  •  •  =  0jk,h- 1  =  0,  where  0jk,m  is  the  jk- th  element  of  0m.  Thus, 
the  asymptotic  distribution  in  (15.4.5)  is  not  immediately  useful  for  testing 


A0  :  Wjk,h  =  0  against  H i  :  u>jkth  ^  0, 


(15.4.6) 
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which  is  a  set  of  hypotheses  of  particular  interest  in  practice.  The  significance 
of  u>jk,h  may  be  checked,  however,  by  testing  9jk,o  =  ■  •  •  =  Ojk,h- 1  =  0.  Using  a 
minor  generalization  of  (15.4.3),  this  hypothesis  can  be  tested  (see  Liitkepohl 
&  Poskitt  (1991)).  '  ■ 

Remark  2  In  sharp  contrast  to  the  case  where  the  VAR  order  is  assumed 
to  be  known  and  finite  (see  Proposition  3.6),  the  asymptotic  variances  of  all 
impulse  responses  are  nonzero  in  the  present  case.  Another  difference  between 
the  finite  and  infinite  order  VAR  cases  is  that  in  the  former  the  asymptotic 
standard  errors  of  the  <Pi  and  (9,  go  to  zero  as  i  increases,  while  the  covariance 
matrix  in  (15.4.1)  is  a  nondecreasing  function  of  i  and  the  covariance  matrix 
in  (15.4.3)  is  bounded  away  from  zero,  for  i  >  0.  ■ 

Remark  3  For  i  =  1,  the  asymptotic  covariance  matrix  of  <P\  (nr)  in  (15.4.1) 
is  ®  Eu.  It  can  be  shown  that  the  same  asymptotic  covariance  matrix 
is  obtained  for  (I>\  from  Proposition  3.6,  if  a  VAR(n)  process  is  fitted  with 
n  greater  than  the  true  order  p  (see  Liitkepohl  (1988a)).  A  similar  result  is 
obtained  for  O^nr)  and  <9*  for  i  =  0, 1  (see  Problem  15.4).  ■ 

Remark  4  The  results  in  Proposition  15.4  can  also  be  used  to  construct  tests 
for  zero  impulse  responses.  This  case  was  considered  by  Liitkepohl  (1996b).  ■ 

Remark  5  Although  forecast  error  and  orthogonalized  impulse  responses  are 
considered  only  in  Proposition  15.4,  similar  results  can  also  be  obtained  for 
the  structural  impulse  responses  discussed  in  Chapter  9.  ■ 


15.4.2  An  Example 

To  illustrate  the  consequences  of  the  finite  and  infinite  VAR  order  assump¬ 
tions,  we  use  again  the  VAR(2)  model  for  the  investment/income/consumption 
data.  Of  course,  the  same  estimated  impulse  responses  are  obtained  as  in  Sec¬ 
tion  3.7.  (The  intercept  form  of  the  model  is  used  now.)  The  standard  errors 
are  different,  however.  In  Figures  15.1  and  15.2,  consumption  responses  to  in¬ 
come  impulses  are  depicted  and  the  two-standard  error  bounds  obtained  from 
both  sets  of  assumptions  are  shown.  In  both  figures,  the  two-standard  error 
bounds  based  on  Proposition  3.6  decline  almost  to  zero  for  longer  lags  while 
the  two-standard  error  bounds  from  Proposition  15.4  are  seen  to  grow  with 
the  time  lag.  This  behavior  reflects  the  additional  estimation  uncertainty  that 
results  from  assuming  that  the  VAR  order  goes  to  infinity  with  the  sample 
size.  Thereby  more  and  more  parameters  are  estimated  as  the  sample  size  gets 
large. 

In  Table  15.2,  forecast  error  variance  decompositions  of  the  system  are 
shown.  Again  most  standard  errors  based  on  the  infinite  VAR  assumption  are 
slightly  larger  than  those  from  Chapter  3,  which  are  also  given  in  the  table. 
Although  it  is  tempting  to  use  the  estimated  standard  errors  in  checking  the 


15  Fitting  Finite  Order  VAR  Models  to  Infinite  Order  Processes 


Fig.  15.1.  Estimated  responses  of  consumption  to  a  forecast  error  impulse  in  income 
with  two-standard  error  bounds  based  on  finite  and  infinite  VAR  order  assumptions. 


Fig.  15.2.  Estimated  responses  of  consumption  to  an  orthogonalized  impulse  in 
income  with  two-standard  error  bounds  based  on  finite  and  infinite  VAR  order  as¬ 
sumptions. 
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Table  15.2.  Forecast  error  variance  decompositions  of  the  investment /income/con¬ 
sumption  system  with  standard  errors  from  two  different  asymptotic  theories 


forecast 
error  in 

forecast 

horizon 

h 

proportions  of  forecast  error  variance,  h  periods  ahead, 
accounted  for  by  innovations  in“ 

investment 

<jjj  i,h 

income 

Wj2,h 

consumption 

Wj3,h 

investment 

1 

1.000(.000)[.000] 

.000(.000)[.000] 

.000(.000)[.000] 

U  =  1) 

2 

.960(.042)[.044] 

,018(.030)[.030] 

.023(.031)[.033] 

3 

.946(.042)[.045] 

.028(.033)[.033] 

.026(.029)[.032] 

4 

.941(.045)[.048] 

,029(.031)[.032] 

.030(.032)[.036] 

8 

.938(.048)[.050] 

.031(.032)[.034] 

.032(.035)[.039] 

income 

1 

.018(.031)[.031] 

.983(.031)[.031] 

.000(.000)[.000] 

(j  =  2) 

2 

.060(.054)[.053] 

.908(.063)[.064] 

.032(.037)[.040] 

3 

.070(.057)[.058] 

.896(.066)[.068] 

.035(.039)[.041] 

4 

.068(.056)[.057] 

.892(.067)[.069] 

.039(.041)[.045] 

8 

.069(.057)[.058] 

.891(.068)[.070] 

.040(.041)[.045] 

consumption 

1 

.080(.061)[.061] 

.273(.086)[.086] 

.647(.090)[.090] 

U  =  3) 

2 

.077(.059)[.059] 

.274(.082)[.082] 

.649(.088)[.088] 

3 

.130(.080)[.080] 

.334(.089)[.091] 

.537(.091)[.092] 

4 

.129(.079)[.079] 

.335(.088)[.090] 

.536(.089)[.091] 

8 

.129(.080)[.081] 

.340(.089)[.092] 

.532(.091)[.093] 

“Estimated  standard  error  based  on  a  finite  known  VAR  order  assumption  in  paren¬ 
theses  and  estimated  standard  error  based  on  an  infinite  VAR  order  assumption  in 
brackets. 


significance  of  individual  forecast  error  variance  components,  we  know  from 
Remark  1  of  Section  15.4.1  that  they  are  not  useful  for  that  purpose  because 
the  asymptotic  standard  errors  from  Proposition  15.4  corresponding  to  zero 
forecast  error  variance  components  are  zero. 


15.5  Cointegrated  Infinite  Order  VARs 

In  Chapter  6,  it  was  discussed  that  assuming  a  fixed  finite  starting  date  is 
advantageous  if  integrated  variables  are  considered.  Therefore,  some  modi¬ 
fications  will  be  necessary  in  defining  infinite  order  processes  for  integrated 
variables.  Details  of  the  model  setup  will  be  given  in  Section  15.5.1.  The  prop¬ 
erties  of  estimators  of  the  parameters  of  such  models  are  considered  in  Sec¬ 
tion  15.5.2,  and  testing  for  the  cointegrating  rank  will  be  discussed  in  Section 
15.5.3. 
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15.5.1  The  Model  Setup 


The  general  framework  presented  in  the  following  is  that  of  Saikkonen  (1992) 
and  Saikkonen  &  Lutkepohl  (1996).  Given  a  K -dimensional  system  of  time 
series  variables  yt  with  cointegrating  rank  r,  we  assume  that  the  variables  are 
arranged  such  that  for  t  =  1,  2, . . . , 


Vi 


-  _R'  7/2)  +  J1 

-  P  (K-r)Vt  +  zt 


Ay^  =  4 


(15.5.1) 


where  y^  and  y ^  are  (r  x  1)  and  (( K  —  r)  x  1),  respectively,  such  that 


Vt  = 


Vi 


(i) 


Vt 


(2) 


as  in  the  triangular  representation  discussed  in  Section  6.3  (see  (6.3.10)). 
Hence,  P(Ar_r^  is  (( K  —  r)  x  r)  such  that 


Ir 

.  P(if-r) 

is  the  cointegration  matrix  and 


Zt  = 


is  a  strictly  stationary  process  with  E(zt)  =  0  and  positive  definite  covariance 
matrix  Uz  =  E(ztz't).  As  a  further  technical  condition  which  is  needed  in 
some  proofs,  we  also  assume  that  zt  has  a  continuous  spectral  density  matrix 
which  is  positive  definite  at  zero  frequency.  For  a  discussion  of  spectral  density 
matrices  of  vector  processes  see,  e.g.,  Fuller  (1976,  Section  4.4).  The  initial 
vector  y0  is  assumed  to  be  such  that  the  process  Ayt  is  stationary. 

In  matrix  form,  the  process  yt  may  be  written  as 


I  r  P  (K-r) 
0  1k-t 


yt  = 


o  o 

0  I  K-r 


Vt- 1  +  zt- 


(15.5.2) 


Multiplying  by  the  inverse  of  the  left-hand  matrix, 


P(A'-r) 
0  Ik-t 


subtracting  yt_  i  on  both  sides  of  the  equation  and  rearranging  terms  gives 


Ayt 


Ir  P  (K-r) 

0  0 


yt- 1  +  vt 


yt- 1  +  vt, 


(15.5.3) 
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where  (3/ 


vt  = 


—  [■ ir  ■  P(K-r)] 

'  I.  -PU-rJ 

0  1  K-r 


and 


Zt 


is  a  stationary  process.  It  is  assumed  to  have  an  infinite  order  VAR  represen¬ 
tation, 


OO 

v't  =  E  GjVt-j  +  ut,  t  €  Z,  (15.5.4) 

l=i 

where  Ut  is  again  standard  white  noise.  Notice  that,  due  to  the  stationarity  of 
Vt,  there  is  no  problem  in  defining  it  for  all  integers  t.  Moreover,  because  the 
process  vt  is  stationary,  it  also  has  an  MA  representation  for  which  we  could 
make  similar  assumptions  as  in  (15.2.4).  We  do  not  need  that  representation 
here,  however,  and  therefore  we  formulate  the  required  assumptions  directly 
for  the  VAR  coefficients.  In  particular,  the  Gy’s  are  assumed  to  satisfy 


det 


^  0  for  |z|  <  1 


and 


E^-u  <  °°- 

i=i 


(15.5.5) 


This  condition  imposes  weak  restrictions  on  the  autocorrelation  structure  of 
the  process  vt  and  is,  for  example,  satisfied  for  VARMA  processes.  From  the 
previous  assumptions,  it  follows  that  if  the  infinite  order  VAR  is  approximated 
by  a  finite  order  process,  the  approximation  error  gets  sufficiently  small  for 
our  purposes,  if  the  order  of  the  approximating  process  is  chosen  as  in  (15.2.5) 
with  VTJ2Zut+ i  llGill  0  as  T  -►  oo. 

Defining 


G*  —  (Gj+i  +  ■  ■  ■  +  Gn ),  j  —  0, 1, . . . ,  n  —  1, 

and 

n— 1 

■3—0 

it  follows  that 


Gn(L)  :=1k-J2  G0Lj  =  Gn{  1)  -  G.;_1(L)(1  -  L)  (15.5.6) 

i= i 

(see  Problem  15.6).  Multiplying  (15.5.3)  by  Gn(L)  and  rearranging  terms  gives 
the  VECM  representation 

n 

Ayt  =  aP'yt_i  +  y^TjAyt_j  +  et,  t  =  n  +  1,  n  +  2, . . 
l=i 


(15.5.7) 
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where 


a  := 


I, 

0  ’ 


et  :=  ut  +  ^  Gjvt-j 

j—n+l 


and 


Er  jV  =Y,GjV  +  Gn-i{L) 

i= i  i=i 

P' 

0 


P' 

0 


=  E(G.  +  G]-i 

3= 1 


LA 


Hence, 

r?  =  Gj  —  (Gj  +  ■  •  •  +  G„) 


P' 

0 


,  j  =  1, ...  ,n 


(see  Problem  15.7).  Although  this  fact  is  not  specifically  indicated,  the 
coefficient  matrices  a  and  It,-,  j  =  1  depend  on  n.  In  particular, 

rn  =  [o  :  rn2],  where  r„2  is  (A'  x  (A'  —  r)).  It  can  be  shown  that  as¬ 
sumption  (15.5.5)  implies  that  the  IYs  are  absolutely  summable,  that  is, 
lina^co  llTjH  exists,  and  the  process  yt  is  well-defined  (Phillips  &  Solo 
(1992,  2.1  Lemma)). 

Rearranging  terms,  the  VECM  (15.5.7)  can  also  be  rewritten  in  levels  VAR 
form  as 


n+1 

Vt  =  E  UjVt-j  +  eti  t  =  n+l,n  +  2, ..., 
l=i 


(15.5.8) 


where 

lh  =  IK  +  ap'  +  I\  =  IK  +  Gi  - 
11  j  =  rj  _  rl-i  =  Gj  -  [0  :  Gl-i,iP(tc-r)  ~  Gl-i,2])  j  =  2,  ■  ■  ■ ,  n, 

nn+ 1  =  —  rn. 

Here  G^-ip  and  Gj_ ij2  are  submatrices  of  Gj_i  consisting  of  the  first  r  and 
last  K  —  r  columns,  respectively.  Thus,  although  the  Tj  depend  on  n,  the 
same  is  not  true  for  the  77 j,  except  for  77„+i.  In  the  following  subsection,  the 
asymptotic  properties  of  the  LS  estimators  of  the  VECM  and  the  levels  VAR 
representations  will  be  considered. 
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15.5.2  Estimation 

Suppose  a  levels  VAR(nT  +  1)  model  of  the  form  (15.5.8)  is  estimated  by 
multivariate  LS  based  on  a  sample  of  size  T.  Notice  that  the  order  of  the  model 
now  depends  explicitly  on  the  sample  size  T,  as  in  the  case  where  stationary 
processes  were  approximated  by  finite  order  VARs,  discussed  earlier  in  this 
chapter.  We  assume  again  that  the  VAR  order  goes  to  infinity  with  the  sample 
size  although  at  a  smaller  rate  than  T.  The  following  proposition,  which  is 
similar  to  Theorem  2  of  Saikkonen  &  Liitkepohl  (1996),  gives  the  details. 
In  stating  the  proposition,  the  LS  estimators  are  denoted  by  II j,  17 (n)  := 
[III,  ..;»}■  nn\,  and  17(n)  :=  [II i,  ■  ■ . ,  17„],  as  before.  Now  we  can  present  the 
result. 

Proposition  15.5  ( Asymptotic  Distribution  of  the  LS  Estimator  of  the  VAR 
Coefficients) 

Suppose  that  finite  order  VAR(nr  + 1)  processes  are  fitted  by  multivariate  LS 
to  a  multiple  time  series  generated  by  the  process  specified  in  Section  15.5.1 
and  assume  that  the  order  r it  depends  on  the  sample  size  T  such  that 


ut  — >■  oo,  n^/T  — »  0,  and  \/T  ||Gj||  — >  0  as  T  — »  oo. 

i=nT-\- 1 

(15.5.9) 

Furthermore,  let  ci,  ci  be  fixed  constants  and  f(n)  a  sequence  of  nonzero 
((Kr  +  K2n )  x  1)  vectors  such  that 

0  <  Ci  <  f(n)'f(n)  <  C2  <  oo  for  n  =  1,  2, _ 


Then 


i/T  —  nr  I  (nr)1  [Iv  (nr)  —  7t(tit)] 

[f [ht)' {H'riTrnT  VECMHnT  (g)  £u)f (nr)]1/2 


Af(0,l), 


(15.5.10) 


where  tv (n)  :=  vec!7(?i)  and  tv (n)  :=  veclJ(n),  as  in  Section  15.2,  HTlT  is  a 
((r  +  Kut)  x  Knr)  matrix  defined  such  that 


[III  II, iT ]  —  [oc :  Pi  :  -  -  -  :  TnT\HnT  +  [Ik  :  0  :  -  -  -  :  0] 

and 


I\it,VECM  E 

( 

«i-l  1 
Ayt-i 

[u(t%  Ay'f-i, . 

\ 

■  ■  -  Ay't-nT\ 

\ 

_  Ayt  —  riT  _ 

/ 

(15.5.11) 


Here  denotes  the  vector  of  the  first  r  components  of  Ut.-i- 
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This  proposition  can  be  proven  analogously  to  Theorem  2  in  Saikkonen  & 
Liitkepohl  (1996).  Clearly,  the  proposition  is  similar  to  Proposition  15.1.  Note, 
however,  that  in  the  present  proposition,  only  the  first  tit  coefficient  matrices 
are  considered,  although  a  VAR(riT  +  l)  process  is  fitted  to  the  data.  Dropping 
the  last  lag  in  deriving  the  asymptotic  distribution  of  the  estimators  ensures 
that  standard  asymptotic  properties  are  obtained.  This  devise  was  also  used 
in  Section  7.6.3  in  deriving  a  Wald  test  for  Granger-causality  in  a  finite  order 
cointegrated  VAR  context. 

Consider  now  the  VECM  with  nr  lagged  differences, 
tit 

Ayt  =  Uyt-i  +  y "jpjAyt^j  +  et.  (15.5.12) 

j= i 

Suppose  that  the  model  is  also  estimated  by  multivariate  LS  based  on  a  sample 
of  size  T .  The  estimators  are  denoted  by  II  and  Tj  and  the  residuals  are 
signified  as  ut(riT)-  Using  this  notation, 

1  T 

Su  =  ~ - y  V'  Ut(nT)ut(nTy  (15.5.13) 

is  an  estimator  of  the  white  noise  covariance  matrix  £u.  The  loading  matrix 
a  may  be  estimated  as  a  =  III,  where  the  latter  matrix  consists  of  the  first 
r  columns  of  II,  as  in  the  EGLS  procedure  presented  in  Section  7.2.2.  As  in 
that  procedure,  the  matrix  may  be  estimated  as 

%K-r)  =  (aS-^a'S-^  (15.5.14) 

where  EG  consists  of  the  last  K  —  r  columns  of  II  (see  Remark  4  of  Section 
7.2.2).  The  next  proposition  summarizes  the  asymptotic  properties  of  the  esti¬ 
mators.  Proofs  can  be  found  in  Saikkonen  (1992)  and  Saikkonen  &  Liitkepohl 
(1996). 

Proposition  15.6  ( Asymptotic  Distribution  of  VECM  Estimators) 

Under  the  conditions  of  Proposition  15.5, 


r(P \k-t)  P(if-r)) 


W 


w*  w 


#' 


K-r  ”  K-r 


ds 


(15.5.15) 


where  and  Wjf  are  independent  ( K  —  r)-  and  r-dimensional  Wiener 

processes,  respectively,  as  in  Proposition  7.2.  Furthermore, 


\/T  -  nT  f  (nrYffinr)  -  Y(^t)]  _d 
[f inT)' (rnT  yECM  ®  Eu)f (tit)]1/2 


(15.5.16) 


where  y(ri)  :=  vec[a  :  Ti  :  •  •  •  :  Tn]  and  y(n)  :=  vec[a  :  Ti  :  •  •  •  :  Tn].  ■ 
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Notice  that  the  asymptotic  distribution  of  the  cointegration  parameters 
in  (15.5.15)  is  the  same  as  that  of  the  corresponding  ML  estimator  for  Gaus¬ 
sian  finite  order  VAR  processes,  as  discussed  in  Section  7.2  (see  in  particular 
Proposition  7.2  and  Remark  3  for  Proposition  7.4).  The  loading  and  short- 
run  parameters  have  asymptotic  properties  similar  to  those  of  finite  order 
processes  as  well.  Their  asymptotic  properties  are  the  same  that  would  be  ob¬ 
tained  if  the  true  p  matrix  were  known  and  used  in  the  estimation  procedure. 

Moreover,  Saikkonen  &  Lutkepohl  (1996)  showed  that  the  white  noise 
covariance  matrix  estimator  XJU  also  has  similar  asymptotic  properties  as  in 
the  finite  order  case.  Furthermore,  they  stated  slightly  more  general  versions  of 
Propositions  15.5  and  15.6  and  discussed  how  the  results  can  be  used  in  testing 
hypotheses  of  parameter  restrictions.  In  particular,  they  considered  the  case 
of  testing  for  Granger-causality.  They  also  discussed  adding  an  intercept  term 
to  the  model.  Saikkonen  &  Lutkepohl  (2000a)  presented  extensions  which  can 
be  used  in  deriving,  for  example,  asymptotic  properties  of  impulse  responses 
in  the  present  framework. 

In  practice,  the  cointegrating  rank  is  usually  unknown  and  has  to  be  de¬ 
termined  from  the  given  multiple  time  series.  How  to  do  so  in  the  present 
framework  of  an  infinite  order  process  is  discussed  next. 


15.5.3  Testing  for  the  Cointegrating  Rank 

In  Section  8.2,  we  have  discussed  testing  the  cointegrating  rank  of  a  finite 
order  VAR  process  by  considering  pairs  of  hypotheses  of  the  form 

iL0  :  rk(LE)  =  r0  against  Hi  :  r0  <  rk(II)  <  n.  (15.5.17) 

In  particular,  the  cases  r\  =  r0  +  1  and  r1  =  K  were  discussed  and  suitable 
likelihood  ratio  tests  were  introduced.  Suppose  now  that  the  test  statistics 
are  computed  in  precisely  the  same  way  as  in  Section  8.2.1,  based  on  the 
VECM  (15.5.12)  with  ut  lagged  differences  of  yt .  In  other  words,  we  compute 
the  statistic  as  if  (15.5.12)  were  a  Gaussian  process  with  lag  order  nr-  To 
emphasize  the  dependence  on  the  lag  order,  we  denote  the  test  statistic  cor¬ 
responding  to  the  pair  of  hypotheses  in  (15.5.17)  by  A^^(r0,ri).  Lutkepohl 
&  Saikkonen  (1999b)  proved  the  following  result. 

Proposition  15.7  ( Asymptotic  Distributions  of  Tests  for  the  Cointegrating 
Rank) 

Suppose  yt  is  generated  by  an  infinite  order  process  as  described  in  Section 
15.5.1.  Moreover,  suppose  that 

ut  — > ►  oo  and  n^/T  — >  0  as  T  — >  oo.  (15.5.18) 

Then  X^\ro,ro  +  1)  and  X^\ro,K)  have  the  same  limiting  null  distribu¬ 
tions  as  for  a  Gaussian  finite  order  process  given  in  Proposition  8.2.  ■ 
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Notice  that  in  this  proposition  we  just  have  an  upper  bound  for  the  rate 
at  which  the  lag  order  ut  has  to  go  to  infinity.  No  lower  bound  for  the  rate 
of  divergence  is  needed.  In  fact,  Liitkepohl  &  Saikkonen  (1999b)  considered 
also  processes  with  nonzero  mean  term  and,  in  addition,  they  treated  the  case 
where  the  lag  order  is  chosen  by  some  model  selection  procedure  instead  of 
a  deterministic  rule  derived  from  (15.5.18).  In  summary,  these  results  show 
that  as  far  as  asymptotic  theory  is  concerned,  the  cointegrating  rank  of  an 
1(1)  process  may  be  chosen  on  the  basis  of  a  finite  order  approximation  rather 
than  a  correctly  specified  model.  This  result  is  not  only  important  because, 
in  practice,  models  are  usually  just  approximations  to  the  true  DGP  and, 
hence,  allowing  explicitly  for  some  approximation  error  is  more  realistic,  it 
is  also  important  because  we  have  proposed  this  approach  for  choosing  the 
cointegrating  rank  of  a  VAR.MA  process  in  Chapter  14,  Section  14.4.2. 


15.6  Exercises 

Problem  15.1 

For  the  invertible  MA(1)  process  yt  =  ut  +  Mut- i  and  n  =  1,2,  determine 
the  matrix  Pn  defined  in  (15.2.7). 

Problem  15.2 

Suppose  the  true  data  generation  mechanism  is  a  univariate  AR(1)  process, 
Ut  =  oiyt- 1  +  Ut-  Assume  that  a  univariate  AR(1)  is  indeed  fitted  to  the  data 
and  compare  the  resulting  approximate  forecast  MSEs  Sy(h)  (given  in  Section 

3.5)  and  Sy(h)  (given  in  Section  15.3.1)  for  h  =  1,2, _ (Hint:  See  Liitkepohl 

(1987,  pp.  76,  77).) 

Problem  15.3 

Suppose  the  true  data  generation  process  is  an  invertible  MA(1),  as  in  Problem 
15.1.  Write  down  explicit  expressions  for  the  asymptotic  covariance  matrices 
of  @i(nT),  Oi(nT),  i  =  1,2,  and  of  &m(nT),  Sm(nT),  m  =  1,2. 

Problem  15. 4 

Let  &i(riT)  and  G>i  be  estimators  of  the  orthogonalized  impulse  responses  6>; 
obtained  under  the  conditions  of  Propositions  15.4  and  3.6,  respectively.  If 
the  true  data  generation  mechanism  is  a  finite  order  VAR(p)  process  and  the 
actual  process  fitted  to  the  data  has  order  nT  >  p,  show  that  the  asymptotic 
covariance  matrices  in  (15.4.3)  and  (3.7.8)  are  identical  for  *  =  0,1. 

Problem  15.5 

Consider  the  investment/income/consumption  system  of  Section  15.3.2  and 
fit  a  VAR(4)  process  to  the  data. 

(a)  Determine  95%  interval  forecasts  for  all  three  variables  and  forecast  hori¬ 
zons  h  =  1,2,3  under  the  assumption  of  a  known  true  VAR  order  of  p  =  4 
and  under  the  assumption  of  an  infinite  order  true  generation  process. 
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(b)  Determine  and  Ot  impulse  responses  and  their  asymptotic  standard 
errors  for  i  =  1, 2, 3, 4  under  both  the  assumption  of  a  finite  and  an  infinite 
true  VAR  order.  Compare  the  estimated  standard  errors  obtained  under 
the  two  alternative  scenarios  for  all  variables. 

Problem  15.6 

Show  that  the  relation  in  (15.5.6)  holds. 

Problem  15.7 

Derive  the  model  representation  (15.5.7).  (Hint:  See  Saikkonen  (1992,  Section 

2)0 
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Multivariate  ARCH  and  GARCH  Models 


16.1  Background 

In  the  previous  chapters,  we  have  discussed  modelling  the  conditional  mean 
of  the  data  generation  process  of  a  multiple  time  series,  conditional  on  the 
past  at  each  particular  time  point.  In  that  context,  the  variance  or  covariance 
matrix  of  the  conditional  distribution  was  assumed  to  be  time  invariant.  In 
fact,  in  much  of  the  discussion,  the  residuals  or  forecast  errors  were  assumed 
to  be  independent  white  noise.  Such  a  simplification  is  useful  and  justified  in 
many  applications. 

There  are  also  situations,  however,  when  such  an  assumption  is  problem¬ 
atic,  for  instance,  when  financial  time  series  are  being  analyzed.  To  see  this, 
consider  the  monthly  returns  of  the  DAX  (German  stock  index)  for  the  pe¬ 
riod  1965-1995  depicted  in  Figure  16.1.  The  autocorrelations  are  all  within  the 
±2/v/T  band  and,  hence,  in  accordance  with  the  results  discussed  in  Chap¬ 
ter  4,  Section  4.4,  one  may  conclude  that  the  returns  are  not  autocorrelated. 
If  they  were  not  only  uncorrelated  but  also  independent,  then  their  squares 
were  independent  too.  That  this  is  not  the  case  is  clearly  seen  in  the  third 
panel  of  Figure  16.1,  where  also  the  autocorrelations  of  the  squared  returns 
are  given.  Consequently,  in  this  case,  assuming  independent  observations  or, 
equivalently,  independent  residuals  in  the  AR(0)  model  yt  =  v  +  ut  is  clearly 
problematic.  Because  we  have  used  the  independence  assumption  in  deriving 
the  ±2 /VT  confidence  bounds  in  Chapter  4,  Section  4.4,  the  conclusion  of 
uncorrelated  returns  may  also  be  questioned  in  this  case. 

The  correlations  in  the  squares  of  the  DAX  returns  shown  in  Figure  16.1 
indicate  that  there  is  conditional  heteroskedasticity.  With  a  little  imagina¬ 
tion,  it  can  also  be  seen  in  the  figure  that  the  volatility  in  the  DAX  returns 
changes  over  time.  It  is  lower  in  the  first  half  of  the  sample  period  than  in 
the  second  half.  Similar  characteristics  in  many  time  series,  in  particular  in 
financial  market  series,  have  motivated  the  development  of  specific  models  for 
conditionally  heteroskcdastic  data. 


16  Multivariate  ARCH  and  GARCH  Models 


Fig.  16.1.  Monthly  DAX  returns  for  the  years  1965-1995  with  autocorrelations  and 
autocorrelations  of  squared  returns. 
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It  may  be  tempting  to  argue  that  the  conditional  mean  is  the  optimal  fore¬ 
cast  and,  hence,  changes  in  volatility  are  of  less  importance  from  a  forecasting 
point  of  view.  This  position  ignores,  however,  that  the  forecast  error  variances, 
that  is,  the  variances  of  the  conditional  distributions  are  needed  for  setting 
up  forecast  intervals.  Taking  into  account  conditional  heteroskedasticity  is 
therefore  important  also  when  forecasts  of  the  variables  under  investigation 
are  desired.  Moreover,  for  example  in  financial  analysis,  forecasts  of  the  fu¬ 
ture  volatility  of  a  series  under  consideration  are  often  of  interest  to  assess 
the  risk  associated  with  certain  assets.  In  that  case,  variance  forecasts  are  of 
direct  interest,  of  course.  Furthermore,  the  volatility  in  a  market  and,  hence, 
the  risk  associated  with  investments  in  a  particular  market  may  have  a  direct 
effect  on  the  expectations  of  the  market  participants.  Hence,  there  may  be  a 
feedback  from  the  second  to  the  first  moments.  Therefore,  the  emphasis  on  a 
more  detailed  modelling  of  the  volatility  of  time  series  was  a  natural  develop¬ 
ment  which  was  boosted  by  Engle’s  (1982)  invention  of  ARCH  ( autoregressive 
conditional  heteroskedasticity )  models.  By  now  the  acronym  ARCH  stands  for 
a  wide  range  of  models  for  changing  conditional  volatility.  Moreover,  there  is 
also  some  literature  on  multivariate  extensions  which  are  the  central  topic  of 
this  chapter. 

Because  many  series  have  a  close  relationship,  it  is  obvious  to  conjecture 
that  an  increase,  say,  in  the  volatility  of  one  series  may  have  an  impact  on  the 
volatility  of  another  series  as  well.  For  example,  this  may  occur  in  exchange 
rates  of  different  currencies,  in  interest  rates  for  bonds  of  different  times  to 
maturity,  or  in  returns  on  stocks  in  a  specific  segment  of  the  market.  Therefore, 
multivariate  models  for  conditional  heteroskedasticity  are  of  interest. 

In  the  following,  a  brief  review  of  some  facts  on  univariate  ARCH  and 
generalized  ARCH  (GARCH)  models  is  given  and  then  multivariate  extensions 
will  be  discussed.  Part  of  this  chapter  reports  results  from  an  article  by  Engle 
&  Kroner  (1995).  There  are  also  a  number  of  review  articles  which  cover 
multivariate  ARCH  and  GARCH  models  among  other  things.  Examples  are 
Bollerslev,  Engle  &  Nelson  (1994),  Bera  &  Higgins  (1993),  Bauwens,  Laurent 
&  Rombouts  (2004),  Bollerslev,  Chou  &  Kroner  (1992),  and  Pagan  (1996). 
The  latter  two  articles  also  survey  some  of  the  applied  literature. 


16.2  Univariate  GARCH  Models 

16.2.1  Definitions 

Consider  the  univariate  serially  uncorrelated,  zero  mean  process  Ut .  For  in¬ 
stance,  ut  may  represent  the  residuals  of  an  autoregressive  process.  The 
Ut  are  said  to  follow  an  autoregressive  conditionally  heteroskedastic  process 
of  order  q  (ARCH(q))  if  the  conditional  distribution  of  ut,  given  its  past 
f2t-i  :=  {ut-i,ut.~2,  •  •  • },  has  zero  mean  and  the  conditional  variance  is 

:=  Var(utl^i-i)  =  E(ut  l^t-i)  =  7o+7iwt-i  +  '  •  • +7qUt-q ,  (16.2.1) 
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that  is,  ~  (0,  <Tj|t_1).  Another,  sometimes  quite  useful  way  to  define 

an  ARCH  process  is  to  specify 

ut  =  at \t-i£t,  £t  ~  i-i.d.(0, 1).  (16.2.2) 

Here  the  i.i.d.  assumption  for  et  is  slightly  more  restrictive  than  the  previous 
definition  which  makes  statements  about  the  first  two  moments  of  the  condi¬ 
tional  distribution  only.  In  the  following,  the  definition  (16.2.2)  will  be  used. 
The  ut  s,  generated  in  this  way,  will  be  serially  uncorrelated  with  mean  zero. 

Originally,  Engle  (1982),  in  his  seminal  paper  on  ARCH  models,  assumed 
the  conditional  distribution  to  be  normal  so  that 

et  ~  i.i.d.  Af(0,  1)  and  ~  J\f(0,  af^^).  (16.2.3) 

Although  different  distributions  were  considered  later  as  well,  even  with  this 
special  distributional  assumption  the  model  is  capable  of  generating  series 
with  characteristics  similar  to  those  of  many  observed  time  series.  In  par¬ 
ticular,  it  is  capable  to  generate  series  with  volatility  clustering  and  outliers 
similar  to  the  DAX  series  in  Figure  16.1.  Even  if  the  conditional  distribution 
underlying  an  ARCH((7)  model  is  normal,  the  unconditional  distribution  will 
generally  be  nonnormal.  In  particular,  it  is  leptokurtic,  that  is,  it  has  more 
mass  around  zero  and  in  the  tails  than  the  normal  distribution  and,  hence,  it 
can  produce  occasional  outliers. 

It  turns  out,  however,  that,  for  many  series,  ARCH  processes  with  fairly 
large  orders  are  necessary  to  capture  the  dynamics  in  the  conditional  vari¬ 
ances.  Therefore,  Bollerslev  (1986)  and  Taylor  (1986)  proposed  to  gain  greater 
parsimony  by  extending  the  model  in  a  similar  manner  as  the  AR  model 
when  moving  to  mixed  AR.MA  models.  They  suggested  the  generalized  ARCH 
(GARCH)  model  with  conditional  variances  given  by 

at\ i-1  =  70  +  7lut-l  d - b  lqut-q  +  Plat-l\t-2  + - b 

(16.2.4) 

These  models  are  briefly  denoted  by  GARCH(g,?n).  They  generate  processes 
with  existing  unconditional  variance  if  and  only  if  the  coefficient  sum 


7l  +  •  •  •  "b  7 q  +  Pi  +  ■■■  +  Pm  <  1  • 


(16.2.5) 


If  this  condition  is  satisfied,  Ut  has  a  constant  unconditional  variance  given 

by 


a 


2 

U 


_ To _ 

1  -  7i - 7 q-  Pi - Pm' 


(16.2.6) 


The  similarity  of  GARCH  models  and  AR.MA  models  for  the  conditional 
mean  can  be  seen  by  defining  vt  :=  uf  —  substituting  vq  —  Vt  for  aq |t_1 

in  (16.2.4)  and  rearranging  terms.  Thereby  we  get 
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Ut  —  70  +  (Pi  +  7l)Mt-l  d - 1 -  (0q+  lq)Ut-q 

+  Vt-0lVt-l - PmVt-m  (16.2.7) 

which  is  formally  an  ARMA(g,m)  model  for  uf.  Here  it  is  assumed  without 
loss  of  generality  that  q  >  m  and  0j  :=  0  for  j  >  m. 


16.2.2  Forecasting 

Although  the  conditional  expectation  of  the  process  Ut  given  Qt.-h  is  zero  and, 
hence,  the  optimal  h-step  forecasts  are  all  zero  for  h  =  1,2,...,  there  is  an 
important  difference  to  the  situation  where  the  ut  are  an  independent  white 
noise  process.  If  the  ut  are  Gaussian  Af(0,  cr//),  a  1-step  ahead  (1  —  a)  100% 
forecast  interval  has  the  form 


Uf(l)  A  cl—a/2&u> 

where  iit(l)  denotes  the  forecast,  as  usual,  and  ci_a/2  is  the  relevant  1  —  a/2 
percentage  point  of  the  normal  distribution  (see  Section  2.2.3).  Thus,  the 
forecast  intervals  are  of  constant  width,  regardless  of  the  forecast  origin  t.  In 
contrast,  if  ut  is  a  GARCH(<7,  m)  process,  the  correct  1-step  ahead  (1  —  a)  100% 
forecast  interval  is 


ut(l)  ±c1_a/2CT(+1|t,  (16.2.8) 

where  the  length  depends  on  the  history  of  the  process  because  the  conditional 
standard  deviation,  at+ut,  varies  over  time. 

To  illustrate  this  phenomenon,  suppose  the  mean-adjusted  DAX  returns 
were  generated  by  a  GARCH(1,1)  model  with  conditionally  normal  compo¬ 
nents  and  conditional  variances 


crf2|i_1  =  0.0003  +  0.120u(_1  +  0.771fT(_1|t_2. 

This  model  was  actually  fitted  to  the  monthly  DAX  returns  by  Liitkepohl 
(1997)  for  the  period  1960-1991.  The  1-step  ahead  95%  forecast  intervals  are 
shown  in  Figure  16.2.  The  unconditional  variance  is  in  this  case 


2  =  70 

1-71-/01 


0.0003 

1  -  0.120  -  0.771 


=  2.75  x  10"3. 


Assuming  mistakenly  that  the  data  is  i.i.d.  normal  and  using  the  foregoing 
white  noise  variance,  results  in  the  constant  forecast  intervals  also  shown  in 
Figure  16.2.  It  is  important  to  note  the  implications  of  these  results.  The 
constant  intervals  completely  ignore  the  variations  in  volatility,  whereas  the 
GARCH  intervals  clearly  reflect  the  greater  forecast  uncertainty  in  times  of 
high  volatility  and  are  narrower  in  times  where  the  stock  market  is  less  volatile. 

As  mentioned  earlier,  if  the  residuals  follow  a  normal  GARCH  process,  the 
unconditional  distribution  of  the  observations  will  generally  be  nonnormal. 


562  16  Multivariate  ARCH  and  GARCH  Models 


Fig.  16.2.  95%  1-step  ahead  forecast  intervals  for  the  DAX  returns  obtained  under 
GARCH  ( - )  and  constant  ( — )  variance  assumptions. 


Hence,  the  constant  forecast  intervals  which  have  been  computed  under  nor¬ 
mality  assumptions  may  not  have  the  desired  95%  probability  content  because 
of  the  false  distributional  assumption.  The  nonnormal  unconditional  distribu¬ 
tion  of  GARCH  processes  also  complicates  multi-step  interval  forecasting. 
Formulas  and  properties  of  multi-step  forecasts  were  discussed  by  Baillie  & 
Bollerslev  (1992).  Without  going  into  details,  it  may  be  worth  noting  that  for 
a  stationary  process,  when  the  forecast  horizon  increases,  the  optimal  forecast 
will  always  approach  the  process  mean  with  the  unconditional  variance  being 
the  forecast  error  variance  and  the  forecast  error  distribution  approaching  the 
unconditional  process  distribution,  which  will  generally  be  nonnormal  if  the 
conditional  distribution  is  normal. 

We  will  now  discuss  how  to  extend  these  concepts  to  the  case  of  vector 
processes.  In  that  context,  we  will  also  address  the  issue  of  estimating  the 
parameters  of  a  GARCH  model. 


16.3  Multivariate  GARCH  Models 

Multivariate  extensions  of  ARCH  and  GARCH  models  may  be  defined  in 
principle  similarly  to  VAR  and  VARMA  models.  Early  articles  on  multivariate 
ARCH  and  GARCH  models  are  Engle,  Granger  &  Kraft  (1986),  Diebold  & 
Nerlove  (1989),  Bollerslev,  Engle  &  Wooldridge  (1988).  There  are  a  number  of 
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complications  in  analyzing  and  estimating  such  models  which  will  be  discussed 
now.  The  simpler  multivariate  ARCH  models  will  be  considered  first. 


16.3.1  Multivariate  ARCH 


Suppose  that  ut  =  («k,  . . .  ,UKt)'  is  a  A'-dimensional  zero  mean,  serially  un¬ 
correlated  process  which  may  be  the  residual  process  of  some  dynamic  model 
and  which  can  be  represented  as 

ut  =  (16.3.1) 

where  £t  is  It -dimensional  i.i.d.  white  noise,  et  ~  i.i.d.  (0 ,Ik),  and  St\t—\  is 

1  /2 

the  conditional  covariance  matrix  of  Ut,  given  ut- 1,  ut- 2,  ....  As  usual,  A)  f_1 
is  the  symmetric  positive  definite  square  root  of  Et\t-i  (see  Appendix  A. 9. 2 
for  details  on  the  square  root  of  a  positive  definite  matrix).  Obviously,  the  ut’s 
have  a  conditional  distribution,  given  fit_  1  :=  {ut-i,ut-2,  ■  ■  •},  of  the  form 


Ut \ftt-i  ~  (0,  St\ t-i)- 

They  represent  a  multivariate  ARCH(g)  process  if 


(16.3.2) 


vech(At|t_1)  =  70  +  l’ivech(wt_iw'_1)  -I - 1-  rqvech(ut-qu't_q),  (16.3.3) 

where  vech  again  denotes  the  half-vectorization  operator  which  stacks  the 
columns  of  a  square  matrix  from  the  diagonal  downwards  in  a  vector,  q0  is  a 
\K(K  +  l)-dimensional  vector  of  constants  and  the  1  )’s  are  (^K(K  +  1)  x 
^K(K  + 1))  coefficient  matrices.  Different  conditional  distributions  have  been 
assumed  and  analyzed.  For  example,  a  multivariate  normal  conditional  distri¬ 
bution  may  be  considered,  i.e.,  et  ~  7V(0,  Ik),  so  that  ut\I2t-i  ~  A/"(0, 
Although  this  distribution  is  perhaps  not  the  most  suitable  one  for  many  fi¬ 
nancial  time  series,  it  will  play  a  role  when  parameter  estimation  is  discussed 
in  Section  16.4.  Conditional  distributions  of  processes  representing  financial 
time  series  are  often  better  represented  by  more  heavy-tailed  distributions 
such  as  ^-distributions  with  a  small  degrees  of  freedom  parameter. 

As  an  example,  consider  a  bivariate  ( K  =  2)  ARCH(l)  process, 


vech 

crll,t|i— 1  cr12,t|t-l 

Cll.tlt-l 

<J12,i|t-l  °22,t|t-l 

(J22,t|t-l 

7l0 

7n  712  7i3 

u\,t- 1 

720 

+ 

721  722  723 

730  _ 

_  731  732  733 

u2,t—  1 

Obviously,  even  this  simple  model  for  a  bivariate  series  has  a  fair  number  of 
parameters  which  makes  it  difficult  to  handle.  In  particular,  the  implications 
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of  a  general  model  of  this  type  for  the  relationships  between  the  variables 
and  their  higher  order  moment  properties  are  not  obvious.  Therefore,  more 
restricted  models  have  been  proposed.  For  instance,  Bollerslev  et  al.  (1988) 
considered  diagonal  ARCH  processes  where  the  matrices  are  all  diagonal. 
In  the  first  order  case,  the  model  has  the  form 


<7ll,t|4-l 

7io 

7n 

0 

0 

m1,4-1 

<7l2,t|4-l 

= 

720 

+ 

0 

722 

0 

<722,4  4—1 

.  T'ao 

0 

0 

733 

u2,t—  1 

Even  simple  processes  of  this  type  can  generate  rich  volatility  dynamics. 
Still,  despite  their  simpler  structure,  processes  of  this  type  involve  nontrivial 
technical  problems.  One  of  them  is  that  the  parameters  have  to  be  such  that 
the  conditional  covariance  matrices  Et\t-i  are  all  positive  definite.  To  guar¬ 
antee  this  property,  Baba,  Engle,  Kraft  &  Kroner  (1990)  and  Engle  &  Kroner 
(1995)  investigated  the  following  variant  of  a  multivariate  ARCH  model, 

zt\ t-i  =  r*  +  r*' +  •  •  •  +  r*q' ut.qu't_qr*q ,  (16.3.4) 

where  the  l  *’s  are  each  ( K  x  K)  matrices.  This  particular  multivariate  model 
has  been  christened  BEKK  model.  Here  the  £t\t-i  are  positive  definite  if  TJ 
has  this  property  which  may  be  enforced  by  writing  it  in  a  product  form, 
Fq  =  Cq'Cq  with  triangular  Cq  matrix.  Another  advantage  of  this  model  is 
that  it  is  relatively  parsimonious.  For  instance,  for  a  bivariate  process  with 
K  =  2  and  q  =  1,  there  are  only  7  parameters,  whereas  the  full  model  has  12 
coefficients.  Moreover,  in  contrast  to  the  diagonal  model,  it  can  produce  quite 
rich  interactions  between  the  conditional  second  order  moments. 

16.3.2  MGARCH 

In  principle,  multivariate  ARCH  models  may  be  generalized  in  the  same  way 
as  in  the  univariate  case.  In  the  multivariate  GARCH  (MGARCH)  model  for 
Ut,  the  conditional  covariance  matrices  have  the  form 

q  rn 

vech(At|t_1)  =  7q  +  ^  1  )vech(Mt_itt't-7)  +  E  Gjvech(Et_^t_j_1), 
j- 1  1= i 

(16.3.5) 

where  the  Gj' s  are  also  fixed  (\K(K  +  1)  x  ^K(K  +  1))  coefficient  matrices. 
For  example,  for  a  bivariate  GARCH(1, 1)  model, 


vech 


oiMlt-i 

cr12,t|t-l 


<712,41*— 1 
°22,t|t-l 


1 

<7  ll,t|4— 1 

= 

<7 12,t|4— 1 

cr22,t|4— 1 
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7io 

7n  7i2  7i3 

ul,t-l 

720 

+ 

721  722  723 

Wl,t-lU2,t-l 

.  730  _ 

_  731  732  733 

M2,i-1 

1 

to 

w 

<Jll,t-l\t-2 

+ 

<721  <722  <723 

<T12,t-l|t-2 

_  <731  <732  <733 

°22,t-l[t-2 

A  VARMA  representation  of  an  MGARCH  process  may  be  obtained  anal¬ 
ogously  to  the  univariate  case  (see  (16.2.7))  by  defining  xt  :=  vech(utu't)  and 
Vt  :=  Xf  —  vech(27t|t_1).  Using  these  specifications  and  substituting  xt  —  vt 
for  vech(At|t_1),  (16.3.5)  can  be  rewritten  as 


ma x(q,rri)  m 

Xt  =  70  +  E  (Uj  +  Gj)xt-j  +  Vt  ~  E  GjVt-j, 
i= 1  i= 1 

where  l )  =  0  for  j  >  q  and  Gj  =  0  for  j  >  m.  This  representation  is 
occasionally  useful  in  deriving  properties  of  MGARCH  processes  (e.g.,  Section 
16.6.1). 

Engle  &  Kroner  (1995)  showed  that  the  MGARCH  process  Ut  with  condi¬ 
tional  covariances  as  given  in  (16.3.5)  is  stationary  if  and  only  if  all  eigenvalues 
of  the  matrix 

q  rri 

E7>  +  EGi  *'6.3.6) 

j=i  i= i 

have  modulus  less  than  one. 

The  parameter  space  of  an  MGARCH  model  has  a  large  dimension  in  gen¬ 
eral  and  needs  to  be  restricted  to  guarantee  uniqueness  of  the  representation 
and  to  obtain  suitable  properties  of  the  conditional  covariances.  To  reduce  the 
parameter  space,  Bollerslev  et  al.  (1988)  discussed  diagonal  MGARCH  mod¬ 
els,  where  the  Tj’s  and  Gj’s  in  (16.3.5)  are  diagonal  matrices.  Alternatively, 
a  BEKK  GARCH  model  of  the  following  form  may  be  useful: 


JV  q 

=  Cq'Cq  +  E  E 

n= 1 j=l 


N 


ri*/  /  7-1*  .  X  'V  X  ^  yi  i * 

*  jn  '  /  J  /  j  *JTjn^Jt—j\t—j—%'Jrjn’ 

n=l j= 1 

(16.3.7) 


where  again  Gq  is  a  triangular  ( K  x  K)  matrix  and  the  coefficient  matrices  1  *n , 
G*n  are  also  (K  x  K).  Given  the  similarity  of  MGARCH  and  VARMA  models, 
it  is  clear  from  Chapter  12,  Section  12.1,  that  restrictions  have  to  be  imposed 
on  the  coefficient  matrices  to  ensure  uniqueness  of  the  parameterization.  Engle 
&  Kroner  (1995)  gave  the  following  properties  of  BEKK  GARCH  models 
which  also  address  the  uniqueness  problem: 
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(1)  From  the  stationarity  condition  (16.3.6),  the  BEKK  model  is  seen  to  be 
stationary  if  all  eigenvalues  of  the  matrix 

N  q  N  m 

EE77"®77»  +  EEG&®  G7n  (16.3.8) 

71=1  J  = 1  77=1  j=  1 

have  modulus  less  than  one. 

(2)  The  BEKK  model  nests  all  positive  definite  diagonal  GARCH  models, 
that  is,  every  diagonal  GARCH  model  with  positive  definite  conditional 
covariance  matrices  has  a  BEKK  representation. 

(3)  The  BEKK  model  (16.3.7)  generates  positive  definite  covariance  matrices 
£t.\t-i  if  o | — 1 5  ^7—  1 1 — 2 5  •  •  • ,  £-m+i\ _m  are  positive  definite  and  if  at  least 
one  of  the  matrices  Cq  ,  G*n,  j  =  1, . . . ,  m,  n  =  1, . . . ,  N,  is  nonsingular 
(see  Engle  &  Kroner  (1995,  Proposition  2.5)). 

(4)  In  the  class  of  BEKK  GARCH(1, 1)  models  with  N  =  1,  the  representation 

^t\t~i  =  Cq'Cq  +  r^ut-iu'^r^  + 

is  unique  if  all  diagonal  elements  of  Cq  are  positive  and  7^  ,  g*n  ±  >  0. 
Here  7J4  1  and  gh  1  represent  the  upper  left-hand  elements  of  1  and 
G\x,  respectively. 

(5)  For  a  more  general  BEKK  GARCH(1, 1)  model  with 

N  N 

Zt\ t-x  =  Cu'Cu  +  E  G?nZt_ilt_2G*lnt 

n=l  n= 1 

uniqueness  is  achieved  by  the  following  restrictions: 

(a)  All  diagonal  elements  of  6"  are  positive. 

(b)  lin  =  G*ln  =  0  for  n  >  K2. 

(c)  In  the  matrices  1  with  rij  =  K(j  —  1)  + 1, . . . ,  Kj ,  and  j  =  1, . . . ,  K, 
the  first  j  —  1  columns  and  the  first  rij  —  K(j  —  1)  —  1  rows  are  zero. 
Moreover,  the  lower  right  hand  element  of  7^-^  nj  >  0. 

(d)  Restrictions  analogous  to  those  for  the  r^n  also  hold  for  the  G\n. 
(Engle  &  Kroner  (1995,  Proposition  2.3)). 

For  illustrative  purposes,  suppose  K  =  3  so  that  N  =  K2  =  9  and  n\  = 
1,2,3;  n 2  =  4,5,6;  n 3  =  7,8,9.  Hence,  a  unique  representation  is  obtained  if 
the  zero  restrictions  shown  in  the  following  matrices  are  imposed: 


’  7ii,i  712,1  7*3,1  " 

O 

O 

O 

1 

A* 1  = 

721,1  722,1  723,1 

II 

721,2  722,2  723,2 

.  731,1  732,1  733,1 

.  731,2  732,2  733,2 

O 

O 

O 

6  7*2,4  7*3,4 

1  1* 

1  13  ~ 

0  0  0 

1  1* 

’  1  14  — 

6  722,4  723,4 

5 

.  731,3  732,3  733,3 

.  6  732,4  733,4 
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"  0 

0  0 

"  0 

0 

0 

II 

0 

722,5  723,5 

n* 

5  1  16 

= 

0 

0 

0 

0 

732,5  733,5  . 

0 

732,6 

733,6  . 

"  0 

0 

7l3,7 

"  0 

0 

0 

II 

♦  L 

0 

0 

723,7 

, 

A*8  = 

0 

0 

723,8 

_  0 

0 

733,7  . 

0 

0 

733,8 

"  0 

0 

0 

1  1* 

1  19  — 

0 

0 

0 

0 

0 

733,9 

The  same  zero  restrictions  are  also  imposed  on  the  G\n.  Of  course,  in  a  specific 
case,  there  may  be  further  zero  restrictions  on  the  coefficient  matrices.  In 
particular,  N  may  be  less  than  K2.  An  example  of  this  type  is  given  in  Section 
16.4.2. 

Given  the  correspondence  between  GARCH  and  VAR.MA  models,  it  should 
be  clear  from  the  discussion  of  uniqueness  of  VARMA  representations  in  Chap¬ 
ter  12  that  a  unique  parameterization  of  a  multivariate  GARCH  representa¬ 
tion  is  not  a  trivial  matter.  Whether  the  constraints  given  here  are  the  most 
operational  ones  in  practice  remains  to  be  seen.  If  a  unique  representation  is 
set  up,  estimation  becomes  possible.  This  issue  will  be  discussed  in  Section 
16.4. 

16.3.3  Other  Multivariate  ARCH  and  GARCH  Models 

Although  the  BEKK  model  with  low  orders  may  be  a  relatively  parsimonious 
representation  of  the  conditional  covariance  structure  of  a  process,  the  number 
of  parameters  still  grows  quickly  with  the  dimension  of  the  underlying  system. 
Therefore,  in  practice,  it  is  only  feasible  if  systems  with  just  a  few  variables 
are  under  consideration  and  further  simplifications  were  proposed  to  alleviate 
modelling  of  higher  dimensional  processes.  Some  of  them  can  be  viewed  as 
special  BEKK  models.  For  example,  Lin  (1992)  specified  a  factor  GARCH 
model,  where  the  l'*n' s  and  G*n’  s  in  a  BEKK  GARCH(1, 1)  model  are  of  the 
form 

1  in  =  InVnC  and  G\n  =  g„VnC,  n  =  1, ...  ,1V.  (16.3.9) 

Here  jn  and  gn  are  scalars  and  rjn  and  £ n  are  ( K  x  1)  vectors  satisfying 
fnfn  =  1,  Vnin  =  1  for  n  =  1, . . . ,  TV  and  ifn£k  =  0  for  n  ^  k.  Thus,  the  1  \*n’s 
and  G\n' s  have  all  rank  1. 

In  some  proposals,  the  conditional  covariance  matrix  has  the  form 

=  QHt\t^Q',  (16.3.10) 

where  Q  is  (K  x  K)  and  does  not  depend  on  t,  whereas  Ht  t_i  is  a  positive 
definite  ( K  x  K )  matrix  which  may  depend  on  t.  For  example,  Vrontos,  Del- 
laportas  &  Politis  (2003)  proposed  to  use  a  triangular  matrix  Q  and  specified 
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=  diag (of*!*-!,  ■  •  • ,  OKt\t-i)  (16.3.11) 

to  be  a  diagonal  matrix  with  univariate  GARCH  conditional  variances  <Jj.t  t_1 
on  the  diagonal.  A  closely  related  model,  the  so-called  generalized  orthogonal 
GARCH  model,  was  proposed  by  van  der  Weide  (2002). 

Clearly,  restricting  the  second  moment  dynamics  to  a  transformation  of 
univariate  GARCH  models  as  in  (16.3.11)  is  restrictive  and,  in  particular,  it 
limits  the  covariance  dynamics  in  a  potentially  undesired  way.  Therefore,  the 
alternative  specification 

Ht\t_i  =  DtRtDt  (16.3.12) 

was  proposed,  where  restrictions  of  different  forms  are  specified  for  the  ( K  x  K ) 
matrices  Dt  and  Rt.  For  example,  if  Rt  =  R  is  a  time  invariant  correlation 
matrix  and  Dt  =  diag(crlt|t_1, . . . ,  <JKt\t-i)  is  a  diagonal  matrix  with  time 
varying  conditional  standard  deviations  on  the  diagonal,  Bollerslev’s  (1990) 
constant  conditional  correlation  (CCC)  MG  ARCH  model  is  obtained.  Clearly, 
in  this  model,  the  time  invariant  R  is  the  correlation  matrix  corresponding 
to  the  covariance  matrix  St\t—i  for  all  t.  Engle  (2002)  extended  the  model  by 
allowing  for  richer  dynamics  and  proposed  the  so-called  dynamic  conditional 
correlation  (DCC)  model.  A  related  model  was  also  proposed  by  Tse  &  Tsui 
(2002). 

In  financial  markets,  it  has  been  observed  frequently  that  positive  and 
negative  shocks  or  news  have  quite  different  effects  (Black  (1976)).  This  so- 
called  leverage  effect  can  be  introduced  in  different  ways  in  MGARCH  models. 
For  example,  Hafner  &  Herwartz  (1998b)  and  Herwartz  &  Liitkepohl  (2000) 
generalized  a  univariate  proposal  by  Glosten,  Jagannathan  &  Runkle  (1993) 
and  replaced 

i by  i Tlut-wl^JW  +  ( J2  u **  <  0 

\fc=l 

(16.3.13) 

in  a  BEKK  model  with  N  =  1.  Here  I(-)  denotes  an  indicator  function  which 
takes  the  value  1  if  the  argument  is  valid  and  0  otherwise  and  Rf  is  an  addi¬ 
tional  (K  x  K)  coefficient  matrix.  Another  approach  to  allow  for  asymmetry 
is  to  use  the  so-called  exponential  GARCH  (EGARCH)  model  proposed  by 
Nelson  (1991).  A  multivariate  version  was  considered  by  Braun,  Nelson  & 
Sunier  (1995). 

A  range  of  other  models  was  also  proposed  and  the  literature  on  MGARCH 
models  has  grown  rapidly  over  the  last  years.  A  recent  survey  was  provided  by 
Bauwens  et  al.  (2004),  where  more  information  on  the  aforementioned  models, 
further  proposals  and  references  can  be  found. 
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16.4  Estimation 

16.4.1  Theory 

Using  Bayes’  theorem,  the  joint  density  function  of  u±, . . .  ,Ut  is  f(u±, . . . ,  Ut) 
=  f{ui)f(u2\ui)  •  •  •  f(uT \ut-i,  ■  ■  Thus,  if  in  (16.3.1)  et  ~  i.i.d.  J\f(0,  IK) 
so  that  the  conditional  distribution  of  ut  given  f}t- i  is  Gaussian  and  if  the 
Ut  are  observed  quantities,  the  log-likelihood  function  of  the  general  GARCH 
model  described  by  (16.3.5),  for  a  sample  u\, . ... ,  ut,  is  given  by 


In  1(5)  =  In  lt(6),  (16.4.1) 

t=i 

where  6  =  vec(7o,  rj, . . . ,  G\, . . . ,  Gm)  is  the  vector  of  unknown  parameters 

and 

In k{5)  =  ~ y  hi 2n  -  ^  In \£t\t-i\  -  t=l,...,T,  (16.4.2) 

where  the  required  initial  values  for  specifying  St.\t-i  are  assumed  to  be  avail¬ 
able.  Similarly,  the  log-likelihood  may  be  set  up  for  special  cases  such  as 
diagonal  or  BEKK  models. 

The  likelihood  function  may  be  maximized  with  respect  to  the  parameters 
1 5  by  using  numerical  methods.  A  closed  form  solution  does  not  exist  because  of 
the  nonlinearity  of  the  function.  For  uniqueness  of  the  maximum  and,  hence, 
the  existence  of  a  unique  ML  estimator,  it  is  important  that  an  identified, 
unique  parameterization  is  used,  e.g.,  the  BEKK  form  of  the  model  with  the 
restrictions  discussed  in  Section  16.3.2.  Of  course,  if  the  log-likelihood  function 
(16.4.1)/(16.4.2)  is  used  although  the  true  distribution  of  the  et  is  nonnormal, 
the  resulting  estimators  will  just  be  quasi  ML  estimators.  Comte  &  Lieberman 
(2003)  showed  that  quasi  ML  estimators  have  the  following  properties. 

Proposition  16.1  ( Properties  of  Quasi  ML  Estimators  of  GARCH  Models) 
Let  Ut.  be  a  BEKK  GARCH  process  satisfying  the  following  conditions: 

(a)  The  parameter  space  is  compact  and  identification  restrictions  are  im¬ 
posed. 

(b)  The  eigenvalues  of  the  matrix  (16.3.6)  have  modulus  less  than  one. 

(c)  £t  =  (eit  ,...,eKtY  ~  i.i.d.  (0,  IK)  with  £it,£jt  independent  for  i  j 
( i,j  =  1. . . .  ,  K)  and  such  that  ut  admits  moments  of  at  least  order  8. 
Moreover,  the  £*  are  continuous  random  variables  with  a  density  which  is 
positive  in  a  neighborhood  of  the  origin. 

(d)  The  initial  values  ut,t  <  0,  are  such  that  the  process  ut  is  strictly  station¬ 
ary. 

Then  the  quasi  ML  estimator  <5  of  <5  obtained  by  maximizing  the  Gaussian 
likelihood  function  exists  and  is  strongly  consistent, 
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<5. 


Moreover,  8  has  an  asymptotic  normal  distribution, 

Vt(5-6)  ■4jV(0)C'r1CoCr1), 

where 


C  i  =  -E 


a2  In  lt(6) 
0888' 


and  Cn  =  E 


5 In  It (8)  din  lt (8) 


08 


OS' 


(16.4.3) 

(16.4.4) 


A  number  of  comments  are  worth  making  regarding  this  proposition. 

Remark  1  It  can  be  shown  that  Cq  =  C\  if  et  is  normally  distributed.  Hence, 
in  this  case,  the  asymptotic  distribution  in  (16.4.3)  becomes  A/"(0,  Cf1),  that 
is,  the  covariance  matrix  is  the  inverse  asymptotic  information  matrix.  ■ 

Remark  2  The  condition  of  a  compact  parameter  space  is  typical  for  nonlin¬ 
ear  estimation  problems.  Although  not  totally  satisfactory,  it  is  not  regarded 
as  very  problematic  because  the  compact  subset  of  the  Euclidean  space  to 
which  it  refers  may  be  so  large  that  the  condition  is  not  really  restrictive.  The 
assumption  regarding  the  initial  values  is  also  not  restrictive  if  the  stationar- 
ity  of  the  process  is  accepted.  It  can  be  replaced  by  the  assumption  that  the 
initial  values  are  fixed,  nonstochastic  values.  ■ 

Remark  3  In  contrast,  the  assumptions  regarding  the  et  are  not  fully  satis¬ 
factory.  In  particular,  the  requirement  that  moments  of  order  8  have  to  exist 
for  ut  is  undesirable  for  financial  time  series  where  the  existence  of  higher 
order  moments  is  regarded  as  problematic.  On  the  other  hand,  the  theorem 
improves  on  previously  available  results  which  shows  how  difficult  it  is  to  de¬ 
rive  asymptotic  properties  of  the  estimators  of  MGARCH  processes.  A  number 
of  other  authors  have  derived  more  specialized  results,  notably  for  univariate 
processes  (see  the  review  articles  mentioned  at  the  end  of  Section  16.1).  For 
multivariate  GARCH  processes,  consistency  of  the  quasi  ML  estimators  was 
shown  by  Jeantheau  (1998)  under  the  main  assumption  of  a  strictly  stationary 
and  ergodic  process.  Ling  &  McAleer  (2003)  derived  asymptotic  normality  of 
quasi  ML  estimators  under  less  restrictive  moment  assumptions  for  a  VARMA 
process  with  CCC  GARCH  residuals.  ■ 

Typically,  the  ut  are  residuals  of  some  dynamic  model.  Suppose  they  are 
the  errors  of  a  VAR(p)  process,  possibly  with  integrated  or  cointegrated  vari¬ 
ables.  Thus,  we  have  a  model  of  the  form 

Vt  =  u  +  AiVt-i  +  •  •  •  +  Apyt-p  +  Ut- 

In  this  case,  the  VAR  parameters  have  to  be  estimated  in  addition  to  the  coef¬ 
ficients  associated  with  the  ut  process.  Setting  up  the  corresponding  Gaussian 
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likelihood  or  quasi  likelihood  function  is  not  difficult.  However,  the  optimiza¬ 
tion  may  be  a  formidable  task.  Assuming  that  the  numerical  problems  can 
be  solved,  there  is  some  hope  that  the  asymptotics  can  also  be  resolved  be¬ 
cause,  under  quite  general  conditions,  the  asymptotic  information  matrix  of 
the  VAR  parameters  and  the  GARCH  parameters  is  block  diagonal  so  that 
the  estimators  of  the  VAR  coefficients  are  asymptotically  independent  of  the 
GARCH  parameter  estimators.  This  result  also  suggests  a  two  step  estimation 
procedure  in  which  the  VAR  coefficients  v,  A±, . . . ,  Ap  are  estimated  by  LS  or, 
if  restrictions  are  imposed  on  the  parameters,  by  EGLS  and  then  a  GARCH 
model  is  fitted  to  the  residuals  of  the  first  stage  estimation. 

Given  that  normality  of  the  conditional  distribution  of  the  ut  is  often  dif¬ 
ficult  to  justify,  in  particular,  in  financial  applications,  it  may  also  be  worth 
pointing  out  that  ML  estimation  with  other  distributions  has  been  studied. 
The  survey  by  Bauwens  et  al.  (2004)  provides  further  information  and  refer¬ 
ences  on  these  issues  as  well  as  computational  aspects  of  ML  and  quasi  ML 
estimation. 


16.4.2  An  Example 

Two  series  of  daily  stock  returns  (first  differences  of  In  prices)  will  be  used  to 
illustrate  the  previous  theoretical  considerations.  In  particular,  returns  of  VW 
(Volkswagen)  common  stock  (yu)  and  preference  stock  (y2t)  for  the  period 
January  1987-December  1992  (1579  observations)  are  used.1  The  two  series 
are  plotted  in  Figure  16.3.  The  corresponding  logarithms  of  the  price  series 
are  both  strongly  related  to  the  performance  of  the  VW  company  and,  hence, 
they  are  likely  to  be  related  to  each  other.  Therefore,  it  makes  sense  to  analyze 
the  stocks  as  a  bivariate  series.  The  In  price  series  were  previously  analyzed 
by  Herwartz  &  Lutkepohl  (2000).  In  contrast  to  these  authors,  we  consider 
the  bivariate  series  yt  =  (yit,  y2t)'  of  returns.  Although  the  In  prices  may  be 
cointegrated,  a  preliminary  analysis  has  shown  that  there  is  weak  evidence  of 
cointegration  at  best.  Therefore,  it  seems  justified  to  focus  on  the  returns  in 
the  following. 

The  two  series  of  stock  returns  display  some  changes  in  their  volatility 
and  there  are  also  some  unusually  large  (in  absolute  value)  observations.  Such 
values  are  often  classified  as  outliers.  Thus,  based  on  the  graphs  in  Figure 
16.3,  one  may  not  expect  the  series  to  be  generated  by  a  Gaussian  process 
and  ARCH  or  GARCH  models  may  be  used  to  capture  the  volatility  dynamics. 

Fitting  VAR(p)  models  of  increasing  order  to  the  bivariate  series  yt ,  it 
turns  out  that  AIC  and  HQ  recommend  an  order  of  p  =  3  while  SC  suggests 
p  =  0.  Therefore,  the  residuals  of  the  following  estimated  VAR(3)  model 
(with  Lvalues  in  parentheses)  will  be  used  in  the  following  bivariate  GARCH 
analysis: 

1  The  price  series  are  from  Deutsche  Finanzdatenbank  Karlsruhe. 
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Fig.  16.3.  Daily  returns  of  VW  common  (yi)  and  preference  stock  (3/2) - 
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(16.4.5) 


The  residual  series  are  plotted  in  Figure  16.4.  They  still  show  volatility  clus¬ 
ters  and  outliers.  Hence,  there  may  be  conditional  heteroskcdasticity  in  the 
residuals  of  model  (16.4.5).  In  that  case,  it  may  not  be  a  good  strategy  to 
choose  the  VAR  order  first  by  one  of  our  standard  model  selection  criteria,  as 
we  have  done  it  here.  Alternatively,  it  may  be  preferable  to  derive  criteria  that 
allow  a  simultaneous  determination  of  the  joint  model  for  the  conditional  first 
and  second  moments  (see  Brooks  &  Burke  (2003)).  We  will  nevertheless  use 
the  residuals  from  the  model  (16.4.5)  in  the  subsequent  analysis  for  illustrative 
purposes. 

Based  on  the  residuals  of  the  model  (16.4.5),  the  following  BEKK  GARCH- 
(1,1)  model  was  estimated  (with  f-values  in  parentheses): 
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(16.4.6) 


Note  that  the  uniqueness  conditions  mentioned  in  Section  16.3.2  are  satisfied 
here.  It  is  not  clear,  however,  that  in  the  present  situation  the  f-ratios  have 
standard  normal  limiting  distributions  because  the  assumptions  of  Proposi¬ 
tion  16.1  are  violated.  In  particular,  we  are  working  with  residuals  from  a 
previously  fitted  model  rather  than  with  original  observations.  Still  the  sizes 
of  the  f-ratios  underneath  the  coefficient  estimates  indicate  some  interaction 
in  the  conditional  second  moments. 

It  would  be  helpful  to  have  tools  for  checking  the  model  quality  and  for 
analyzing  the  relationships  summarized  in  the  model.  For  model  checking, 

_ i  /o 

the  estimated  £*  =  t_1Ut  can  be  used.  Standardized  estimated  e*  (divided 
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Fig.  16.4.  Residual  series  of  model  (16.4.5). 
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Fig.  16.5.  Standardized  residuals  of  model  (16.4.6)  (eit  upper  panel,  eh*  lower 
panel). 
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by  their  estimated  standard  deviations)  are  plotted  in  Figure  16.5.  Volatility 
clusters  are  not  quite  so  obvious  anymore  as  in  Figure  16.4.  On  the  other 
hand,  outliers  are  still  present  which  sheds  doubt  on  the  normality  of  the  et. 
Some  tests  for  model  adequacy  will  be  discussed  in  the  next  section. 


16.5  Checking  MGARCH  Models 

16.5.1  ARCH-LM  and  ARCH- Portmanteau  Tests 

Before  an  MGARCH  model  is  fitted  to  the  residuals  of  a  VAR  or  VECM,  one 
may  want  to  check  if  ARCH  effects  are  present  in  the  residuals.  An  LM  test 
is  a  standard  tool  for  this  purpose  (e.g.,  Doornik  &  Hendry  (1997)).  The  idea 
is  to  consider  the  auxiliary  model 

vech  (utut)  =  f30  +  Hivech(rtt_iM(_1)  +  •  •  •  +  Bqvech(ut_qut_q)  +  errort, 

(16.5.1) 

where  /3o  is  \K(K +l)-dimensional  and  the  Bj’s  are  (| K(K+1)  x  | K(K  +  1)) 
coefficient  matrices  (j  =  !,...,(/).  If  all  the  Bj  matrices  are  zero,  there  is  no 
ARCH  in  the  residuals.  Therefore,  the  pair  of  hypotheses 

H0  :  Bi  =  •  •  •  =  Bq  =  0  versus  :  B{  V  0  or  •  •  •  or  Bq  ^  0,  (16.5.2) 

is  checked.  It  turns  out  that  the  corresponding  LM  statistic  can  be  determined 
by  replacing  all  unknown  ut  s  in  (16.5.1)  by  estimated  residuals  from  a  VAR 
or  VECM,  say,  and  estimating  the  parameters  in  the  resulting  auxiliary  model 
by  LS.  Denoting  the  resulting  residual  covariance  matrix  estimator  based  on 
(16.5.1)  by  Vvech  and  the  corresponding  matrix  obtained  for  q  =  0  by  Vo,  the 
relevant  LM  statistic  can  be  shown  to  be  of  the  form 

LMMARCH{q )  =  \tK{K  +  1)  -  TtrtVvechV-1).  (16.5.3) 

Under  the  null  hypothesis,  the  statistic  has  an  asymptotic  y2  {qK2  {K +1)2  /  A)- 
distribution,  if  ut  satisfies  standard  conditions  (see  Doornik  &  Hendry  (1997, 
Sec.  10.9.2.4)). 

In  (16.5.1),  each  of  the  Bj  matrices  is  of  dimension  (^ K(K  + 1)  x  ^ K(K  + 
1))  and,  hence,  the  auxiliary  model  involves  a  large  number  of  parameters 
even  if  the  order  q  and  the  dimension  of  the  process  K  are  only  moderate. 
Therefore  the  test  is  not  suitable  to  check  for  large  q ,  unless  the  sample  size 
is  very  large  too.  It  is  possible,  however,  to  apply  the  test  to  each  of  the  K 
residual  series  individually. 

From  the  VARMA  representation  of  an  MGARCH  process  in  Section 
16.3.2,  it  can  be  seen  that  there  is  no  ARCH  in  the  process  Ut,  if  the  pro¬ 
cess  xt  :=  vech(ut?4)  has  no  serial  correlation.  This  observation  suggests  that 
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we  may  apply  a  portmanteau  test  to  xt  to  check  for  ARCH  in  u,t  ■  Thus,  one 
may  use 

h 

QARGH  ;=  t^C[C-lCtC~l)  (16.5.4) 

i=  1 

or  the  associated  modified  version 
h 

QhRCH  ■■=  T 2  £(T  -  i r1  tr(C'C'0-1CiC0-1)  (16.5.5) 

i—1 

where  now  Cx  =  T^1  J]fLi+i(x*  “  x)(xt-i  -  x)'  (*  =  0, 1, . . . ,  h). 

The  asymptotic  ^-distributions  of  these  tests  follow  from  the  results  in 
Section  4.4.3,  if  xt  is  indeed  white  noise.  In  practice,  it  will  usually  be  replaced 
by  a  quantity  based  on  estimation  residuals  Ut .  A  rigorous  treatment  of  the 
properties  of  the  statistics  in  that  case  seems  to  be  still  missing.  In  principle, 
the  ARCH-portmanteau  test  can  also  be  applied  to  the  individual  residual 
series. 


16.5.2  LM  and  Portmanteau  Tests  for  Remaining  ARCH 

In  practice,  it  is  also  useful  to  check  for  remaining  ARCH  in  the  residuals  of  a 
fitted  ARCH  or  GARCH  model.  Such  tests  are  of  particular  importance  in  the 
present  context  because  low  order  multivariate  models  are  typically  fitted  as 
a  first  attempt  to  account  for  conditionally  heteroskedastic  residuals.  Higher 
order  models  often  have  an  excessive  number  of  parameters  and  the  estimates 
are  difficult  to  compute  numerically.  Therefore,  it  makes  sense  to  start  with 
low  order  models  and  increase  the  order  only  if  the  low  order  model  cannot 
capture  the  second  order  moment  dynamics  in  the  data  properly.  Hence,  tests 
for  remaining  ARCH  in  the  residuals  of  an  MGARCH  model  are  needed. 

Both  the  ARCH-LM  and  the  ARCH-portmanteau  tests  have  been  used 
for  this  purpose.  In  that  case,  the  ut’s  in  the  xt  vectors  are  replaced  by  the 
estimated  et  from  (16.3.1).  In  other  words,  St  :=  E^^ut  is  used  instead 
of  Ut-  Here  ML  estimators  are  signified  by  a  tilde.  Whereas  the  LM  tests 
maintain  their  validity  under  general  conditions  in  the  present  situation  (Engle 
&  Kroner  (1995)),  the  same  is  not  true  for  the  portmanteau  tests.  They  have 
still  found  widespread  use  in  applied  work  (see  Tse  &  Tsui  (1999)  for  references 
and  further  discussion). 

Again,  it  may  be  useful  to  apply  the  tests  not  only  to  the  multivariate 
residual  vectors  but  also  to  the  univariate  components  separately.  There  are 
also  other  tests  for  remaining  ARCH  which  have  a  sounder  theoretical  ba¬ 
sis  than  the  portmanteau  tests  (see  Bauwens  et  al.  (2004)  for  a  review  and 
Lundbergh  &  Terasvirta  (2002)  for  a  discussion  of  the  univariate  case). 
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16.5.3  Other  Diagnostic  Tests 

Other  diagnostic  tools  for  checking  the  validity  of  fitted  MGARCH  models 
are  also  available.  In  fact,  some  of  the  residual  diagnostics  for  VAR  models 
are  also  applicable  here.  Instead  of  the  ut’ s,  the  et' s  should  now  be  used 
as  the  basic  residuals.  For  example,  they  can  be  used  to  perform  tests  for 
nonnormality.  Although  the  necessary  extensions  are  in  many  cases  possible, 
some  care  is  needed  in  the  present  context.  It  can  by  no  means  be  taken  for 
granted  that  all  the  procedures  work  properly.  The  case  of  normality  tests  and 
related  caveats  when  they  are  applied  to  GARCH  residuals  was  discussed  by 
Fiorentini,  Sentana  &  Calzolari  (2004). 


16.5.4  An  Example 

As  an  example,  we  consider  again  the  VW  stock  returns.  In  Section  16.4.2, 
we  have  fitted  an  MGARCH(1, 1)  model  because  the  residuals  of  the  model 
(16.4.5)  appeared  to  have  volatility  clusters.  Now  we  can  use  ARCH-LM  tests 
and  formally  test  for  conditional  heteroskedasticity  of  the  ut  s.  Some  results 
are  presented  in  Table  16.1.  Both  bivariate  and  univariate  tests  applied  to  the 
individual  residual  series  clearly  reject  the  no-ARCH  null  hypothesis.  Thus, 
there  is  strong  evidence  in  favor  of  conditionally  heteroskedastic  residuals.  We 
do  not  present  results  of  the  ARCH-portmanteau  test  because  its  validity  is 
not  clear. 


Table  16.1.  ARCH-LM  tests  for  ut  residuals  from  (16.4.5) 


test 

bivariate 

Wit 

U2t 

LM (1)  LM{ 4) 

LM (1)  LM{ 4) 

LM  (1)  LM (4) 

test  value 

147.4  245.5 

54.9 

62.1 

56.1  70.8 

asymptotic  distribution 

X2(9)  X2 (36) 

x2(i) 

X2(4) 

X2(l)  X2(4) 

p- value 

0.00  0.00 

0.00 

0.00 

0.00  0.00 

Of  course,  the  fact  that  there  may  be  ARCH  in  the  residuals  does  not 
necessarily  mean  that  an  MGARCH(1, 1)  process  is  a  suitable  model.  There- 
fore,  we  also  applied  tests  for  remaining  ARCH  to  the  residuals  et  =  E^^ut 
based  on  (16.4.6).  The  test  results  are  given  in  Table  16.2.  Now  none  of  the 
ARCH-LM  tests  rejects  the  null  hypothesis  at  conventional  significance  lev¬ 
els.  On  the  other  hand,  applying  nonnormality  tests  confirms  what  could  have 
been  conjectured  by  looking  at  the  residuals  in  Figure  16.5,  namely  that,  due 
to  the  outliers,  normality  of  the  conditional  distribution  is  not  likely  to  be  a 
reasonable  assumption.  Thus,  it  may  be  worth  trying  some  other  distribution 
for  the  et  or  some  other  model  than  the  standard  BEKK  GARCH(1, 1)  we 
have  presented  in  Section  16.4.2. 
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Table  16.2.  ARCH-LM  and  nonnormality  ( Xsk )  tests  for  et  residuals  associated 
with  (16.4.6) 


test 

bivariate 

Sit 

£2t 

ttf(l)  LM( 4) 

A sk 

LM{  1)  LM( 4)  \sk 

LM(  1)  LM{ 4) 

sk 

test  value 

10.43 

33.48 

12770 

0.067 

0.364  4440 

0.099 

0.385 

10314 

asymp.  distr. 

X20) 

X2(36) 

X2(4) 

x2(i) 

X2(4)  X2(2) 

x2(i) 

X2(4) 

X2(2) 

p- value 

0.32 

0.59 

0.00 

0.80 

0.99  0.00 

0.75 

0.98 

0.00 

16.6  Interpreting  GARCH  Models 

16.6.1  Causality  in  Variance 

As  we  have  seen  in  Chapter  2,  Section  2.3.1,  Granger’s  definition  of  causality 
is  based  on  forecasts.  We  have  also  seen  in  Chapter  2  that,  under  suitable 
conditions,  optimal  forecasts  are  obtained  as  conditional  expectations.  There¬ 
fore,  Granger-causality  may  be  defined  in  terms  of  conditional  expectations. 
In  other  words,  we  may  define  a  time  series  variable  xt  to  be  causal  for  zt,  if 

E(zt+1\zt,zt-i,...)  ^  E(zt+i\zt,zt-i,  ■  ■ .  ,xt,xt-i,  ■ .  ■)■  (16.6.1) 

This  definition  suggests  a  direct  extension  to  higher  order  conditional  mo¬ 
ments.  We  define  xt.  to  be  causal  for  zt  in  r-th  moment  if 

E(zl+i\zt,  z,- 1, . . . )  ^  E(zrt+1\zt,  zt-!,  ...,xt,  xt- 1, . . . ).  (16.6.2) 

Thus,  (16.6.1)  defines  causality  in  mean  and  considering  the  central  second 
moments  in  (16.6.2)  gives  a  definition  of  causality  in  variance  which  is  anal¬ 
ogous  to  the  previous  definition  of  Granger-causality.  The  interpretation  is 
also  analogous  to  that  of  Granger-causality  in  mean.  In  other  words,  if  xt 
is  causal-in- variance  for  Zt,  the  conditional  volatility  of  zt  can  be  predicted 
more  precisely  by  taking  into  account  present  and  past  information  in  Xt  than 
without  taking  this  information  into  account. 

If  the  conditional  covariance  structure  can  be  described  by  multivariate 
ARCH  or  MG  ARCH  models,  the  restrictions  implied  by  these  definitions  are 
also  similar  to  those  for  Granger-causality  in  VAR  and  VARMA  models  (see 
Comte  &  Lieberman  (2000)).  In  other  words,  they  can  be  described  in  terms 
of  zero  restrictions  on  the  ARCH  or  MGARCH  parameters.  Depending  on 
the  specific  parameterization  of  the  MGARCH  model,  the  restrictions  can  be 
nonlinear  in  the  present  situation,  however.  Tests  for  causality  in  variance 
were  proposed  and  investigated  by  Cheung  &  Ng  (1996),  Hong  (2001),  and 
Pantelidis  &  Pittis  (2004). 

It  is  also  possible  to  generalize  the  causality  definition  and  specify,  for 
example,  conditions  for  both  the  conditional  first  and  second  order  moments 
(e.g.,  Granger,  Robins  &  Engle  (1986)).  More  generally,  one  may  consider 
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the  full  conditional  distributions  rather  than  just  specific  moments.  In  other 
words,  one  may  define  xt  to  be  causal  for  zt  if 

-?zt+i|zt,zt_i,...(')  ~f~  (16.6.3) 

where  Fz\xf)  denotes  the  conditional  distribution  function  of  z  given  x.  Gen¬ 
eralizing  these  concepts  to  the  case  where  Xt  and  Zt  are  vectors  is  theoretically 
straightforward,  as  in  the  case  of  Granger-causality  in  mean. 

16.6.2  Conditional  Moment  Profiles  and  Generalized  Impulse 
Responses 

Impulse  responses  were  used  among  other  tools  for  analyzing  the  relations 
between  the  variables  of  linear  models  such  as  VARs  and  VECMs.  In  lin¬ 
ear  models,  they  have  the  advantage  of  being  time  invariant  and  their  shape 
is  invariant  to  the  size  and  direction  of  the  impulses.  These  features  enable 
the  analyst  to  represent  the  reactions  of  the  variables  to  impulses  hitting  the 
system  in  a  small  set  of  graphs.  GARCH  models  are  nonlinear  models,  how¬ 
ever.  In  such  models,  the  situation  is  quite  different.  In  general,  in  a  nonlinear 
model,  the  marginal  effect  of  an  impulse  will  depend  on  the  state  of  the  sys¬ 
tem  when  the  impulse  arrives.  Thus,  it  depends  on  the  history  of  the  variables 
and  it  may  be  different  in  each  time  point  during  the  sample.  Moreover,  the 
shape  of  the  impulse  responses  will  generally  depend  on  the  size  and  direction 
of  the  impulse.  For  example,  quite  different  reactions  may  be  obtained  from 
positive  and  negative  impulses.  In  a  linear  model,  a  negative  impulse  of  one 
unit  induces  the  same  responses  of  the  variables  with  opposite  sign  as  a  posi¬ 
tive  impulse  of  one  unit.  In  contrast,  in  a  nonlinear  model,  a  positive  impulse 
may,  e.g.,  induce  almost  no  reaction  of  the  variables  whereas  a  corresponding 
negative  impulse  hitting  the  system  at  the  same  state  may  lead  to  a  strong 
reaction.  These  features  are  quite  plausible  in  some  systems.  For  example, 
if  the  impulses  represent  news  arriving  in  a  financial  market,  positive  news 
may  have  a  quite  different  effect  than  negative  news.  Hence,  nonlinear  mod¬ 
els  clearly  have  their  attractive  features  for  describing  economic  systems  or 
phenomena. 

Still,  the  greater  flexibility  of  nonlinear  models  makes  them  more  difficult 
to  interpret  properly.  In  fact,  it  is  not  obvious  how  to  define  impulse  responses 
of  nonlinear  models  in  a  meaningful  manner.  Gallant,  Rossi  &  Tauchen  (1993) 
proposed  so-called  conditional  moment  profiles  which  may  give  useful  informa¬ 
tion  on  important  features  and  implications  of  nonlinear  multiple  time  series 
models.  In  the  spirit  of  their  definition,  we  consider  quantities  of  the  general 
form 

E[g(yt+h)\yt  +  £,  Mt-i]  ~  E\g(yt+h)\yt,  h=  1,2,...,  (16.6.4) 

where  g(-)  denotes  some  function  of  interest,  £  represents  the  impulses  hitting 
the  system  at  time  t,  and  £2t-\  (z/t-i>  Vt-2,  ■  ■  - )  denotes  the  history  of 
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the  variables  at  time  t.  In  other  words,  the  conditional  expectation  of  some 
quantity  of  interest,  given  the  history  of  yt  in  period  t,  is  compared  to  the 
conditional  expectation  that  is  obtained  if  a  shock  £  occurs  at  time  t.  For 
example,  defining 

g(yt+h)  =  [ yt+h  -  E(yt+h\nt+h_1)][yt+h  -  E(yt+h\f2t+h-i)]'  (16.6.5) 

results  in  conditional  volatility  profiles,  which  may  be  compared  to  a  baseline 
profile  obtained  for  a  specific  history  of  the  process  and  a  zero  impulse.  Clearly, 
in  general  the  conditional  moment  profiles  depend  on  the  history  i?t_ i  as  well 
as  the  impulse  £.  Similar  quantities  were  also  considered  by  Koop,  Pesaran  & 
Potter  (1996)  who  called  them  generalized  impulse  responses. 

If  models  with  ARCH  or  M GARCH  errors  are  used  to  describe  the  volatil¬ 
ity  dynamics  of  a  financial  market,  the  volatility  resulting  from  the  arrival 
of  news  may  be  of  interest  (see  Engle  &  Ng  (1993)).  In  this  case,  using  the 
function  (16.6.5)  and  comparing  conditional  covariance  matrices 

Et+h\t  =  E{[yt+h  —  E(yt+h  \^t+h—i)]  [yt+h  ~  E(yt+h\^t+h—  i)]  \yti  Elt— i } , 

based  on  the  actual  history  at  time  t,  to 

=  E{[yt+h~ E(yt.+h\^t+h-i)][yt+h  —  E(yt+h]^t+h-i)]  |j/t+£> 

for  h  =  1,2,...,  may  give  an  impression  of  the  reactions  of  the  market  under 
consideration.  For  instance,  for  the  BEKK  GARCH(1, 1)  model,  we  get 

zt+i\ t  =  C*'c*  +  r*[E{utu't\yu  nt^)r*Y  +  (ie.6.6) 

The  quantities  in  (16.6.6)  are  usually  computed  using  the  estimates  of  the 
conditional  mean  equation  and  the  relevant  GARCH  volatility  model.  The 
matrix  E(ut‘u't\yt,  fA-i)  =  utu't  is  replaced  by  ut.u't,  where  the  Ut  are  typ¬ 
ically  residuals  from  estimating  the  conditional  mean  model.  If  the  corre¬ 
sponding  quantities  related  to  an  impulse  £  are  considered,  the  impulse  is 
simply  added  to  the  Ut .  Because  Et+h\t,  h  =  2, 3, . . .,  is  a  convenient  estimator 
for  E(ut+h,u't+h\yti  Gt_i),  recursive  forecasts  of  future  volatility,  conditional 
on  information  which  is  available  at  time  t,  are  computed  as: 

y  _  .  7-1*/  y  7-1*  I  /"'f*/  y  /''f*  7  _  O  Q 

^t+h\t  ~  ^0  '  1  H^t+h-llt1  11  I  ^ Hi  n  —  Z,  O,  .  .  .  . 

(16.6.7) 

From  the  conditional  covariance  matrices,  conditional  moment  profiles  are 
obtained  as  differences 

<t>nt,h(  0  <I>1  Kt,h(0 

_  4>lKt,h(0  ■  ■  ■  (j)KKt,h(0 


~  Ef+h\t  Ei 


(16.6.8) 


582  16  Multivariate  ARCH  and  GARCH  Models 


Although  these  quantities  may  be  interesting  to  look  at,  they  depend  on 
t,  h,  and  £.  Hence,  there  is  a  separate  impulse  response  function  for  each 
given  t  and  £.  In  empirical  work,  it  will  therefore  be  necessary  to  summarize 
the  wealth  of  information  in  the  conditional  moment  profiles  in  a  meaningful 
way.  In  a  study  of  two  stock  price  series,  Herwartz  &  Liitkepohl  (2000),  for 
example,  considered  the  following  summary  statistics: 

•  Averages  over  all  histories  for  different  impulse  vectors  £,  4>ij..h(£,)  = 

r-'ELW  0- 

•  Averages  over  a  large  range  of  different  impulse  vectors  £r,  (f>ijt,h( ■)  = 

R-1  J2r^i  for  given  values  of  t  and  h.  Here  R  is  the  number  of 

impulses  considered.  The  impulse  may,  for  instance,  be  obtained  from 
the  estimated  model  residuals. 

Although  these  summary  statistics  condense  the  information  in  the  condi¬ 
tional  moment  profiles  considerably,  they  are  still  a  rich  source  of  information 
which  can  be  presented  in  graphs  or  further  condensed  by  fitting  nonpara- 
metric  density  functions  or  using  other  summary  statistics  (see  Herwartz  & 
Liitkepohl  (2000)). 

Of  course,  in  practice,  an  additional  obstacle  is  that  the  actual  data  gener¬ 
ation  process  is  unknown  and  estimated  models  are  available  at  best.  In  that 
case,  the  conditional  moment  profiles  or  generalized  impulse  responses  will  be 
computed  from  estimated  quantities  only.  They  are  therefore  also  estimates 
and  it  would  be  useful  to  have  measures  for  their  sampling  variability.  It  is 
not  clear  how  this  additional  information  is  computed  and  presented  in  the 
best  way  in  practice.  In  any  case,  if  only  the  estimated  quantities  are  available 
and  presented,  it  is  useful  to  keep  in  mind  these  further  limitations  when  the 
results  are  interpreted. 

It  is  naturally  of  interest  to  better  understand  what  the  various  models  for 
conditional  volatility  can  tell  us  about  the  relations  between  variables  and, 
hence,  about  what  is  actually  going  on  in  a  particular  market  or  segment  of  the 
economy.  Therefore  it  is  not  surprising  that  the  interpretation  of  MGARCH 
models  is  a  field  of  active  research.  Some  important  recent  contributions  in 
addition  to  those  noted  earlier  are  Engle  &  Ng  (1993),  Lin  (1997),  and  Hafner 
&  Herwartz  (1998a). 


16.7  Problems  and  Extensions 

There  are  a  number  of  problems  associated  with  ARCH  and  GARCH  mod¬ 
elling.  Some  of  them  have  been  mentioned  in  earlier  sections  of  this  chapter 
but  may  be  worth  emphasizing  again.  In  addition,  there  are  some  problems 
which  we  have  not  addressed  so  far. 

First,  due  to  the  highly  nonlinear  form  of  the  log-likelihood  function  and 
the  potentially  large  number  of  parameters  in  a  multivariate  GARCH  model 
which  have  to  satisfy  a  number  of  restrictions,  computing  ML  estimates  is 
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a  difficult  task.  Therefore  it  is  highly  desirable  to  develop  fast  and  robust 
optimization  algorithms  which  work  well  under  these  particular  conditions. 
A  review  and  comparison  of  some  of  the  available  software  was  provided  by 
Brooks,  Burke  &  Persand  (2003). 

Secondly,  a  sound  analysis  of  conditional  heteroskedasticity  in  a  multi¬ 
variate  time  series  context  requires  that  at  least  the  asymptotic  properties 
of  the  estimators  are  known.  As  we  have  seen  in  Section  16.4.1,  some  asymp¬ 
totic  theory  is  available  for  quasi  ML  estimators  of  specific  MGARCH  models. 
Unfortunately,  the  required  conditions  are  not  satisfactory  in  all  situations. 
Hence,  developing  asymptotic  theory  under  more  general  conditions  is  desir¬ 
able. 

Third,  a  toolkit  for  model  specification  and  model  checking  is  available, 
as  we  have  seen  in  Section  16.5.  There  are  some  open  questions  regarding  the 
statistical  properties  of  these  tools,  however.  Moreover,  given  the  wealth  of 
possible  model  specifications,  some  more  refined  tools  are  desirable  that  help 
the  analyst  to  find  the  best  specification  for  a  particular  data  set  and  analysis 
objective  and  for  discriminating  between  alternative  models. 

Fourth,  although  a  range  of  proposals  have  been  made  on  how  to  interpret 
multivariate  GARCH  models,  the  available  tools  leave  room  for  improvements. 
The  nonlinearity  of  these  models  makes  it  more  difficult  to  extract  the  essential 
features  than  in  linear  models  for  the  conditional  mean. 

Finally,  there  are  many  features  in  financial  and  other  economic  data 
which  are  not  described  well  by  the  GARCH  models  considered  in  this  chap¬ 
ter.  Therefore,  a  range  of  other  models  have  been  proposed  that  can  capture 
specific  aspects  of  the  distributional  properties  of  financial  series  in  a  more 
satisfactory  way.  For  example,  exogenous  variables  may  be  included  in  a  mul¬ 
tivariate  GARCH  model  (see  Engle  &  Kroner  (1995)).  Also,  as  mentioned  in 
Section  16.1,  the  volatility  in  a  series  may  have  an  impact  on  the  conditional 
mean.  To  account  for  this  possibility,  it  may  be  useful  to  allow  conditional  vari¬ 
ances  to  enter  the  conditional  mean  function  (Engle,  Lilien  &  Robins  (1987)). 
These  so-called  ARCH-in-mean  (ARCH-M)  models  may  also  be  generalized 
to  the  multivariate  case. 

Stochastic  volatility  models  represent  another  approach  to  modelling  time- 
varying  volatility.  In  this  approach,  the  conditional  covariance  matrix  depends 
on  an  unobserved  latent  process  and  not  on  past  observations  as  in  the  ARCH 
model.  For  instance,  in  the  univariate  case,  letting  et  ~  i.i.d.  A/”(0, 1)  and 
specifying  ut  :=  crt£t,  the  logarithm  of  the  conditional  standard  deviation  is 
assumed  to  be  generated  as 

Inert  =  (plnat-i  +  r/Kt, 

where  Ut  ~  i.i.d.  A/"(0, 1)  and  <p  and  are  constant  parameters.  A  survey 
of  multivariate  stochastic  volatility  models  was  given  by  Ghysels,  Harvey  & 
Renault  (1996).  It  may  also  be  worth  noting  that,  in  some  sense,  random  co¬ 
efficient  autoregressive  models  may  be  regarded  as  extensions  of  multivariate 
ARCH  models  (see  Wong  &  Li  (1997)). 
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16.8  Exercises 

Problem  16.1 

Write  down  BEKK  GARCH  models  explicitly  for  the  following  combinations 
of  N  and  q  in  (16.3.7): 

(N,q)  =  (  1,1),  (2,1),  (1,2),  (2,  2). 

Problem  16.2 

Write  down  the  factor  MGARCH  model  (16.3.9)  explicitly  for  N  =  2. 
Problem  16.3 

Write  down  in  detail  all  elements  of  St\t- 1  of  a  factor  MGARCH  model  as 
proposed  by  Vrontos  et  al.  (2003)  (see  Section  16.3.3)  for  the  case  of  a  bivariate 
series  (K  =  2). 

Problem  16. 4 

Consider  the  DEM/USD  and  GBP/USD  exchange  rate  series  from 
www.jmulti.de  — »  datasets 

(File  exrate.dat)  and  perform  the  following  analysis  steps: 

(a)  Eliminate  all  rows  with  missing  values  from  the  exchange  rate  data  set. 

(b)  Determine  the  VAR  order  by  model  selection  criteria. 

(c)  Plot  the  autocorrelations  series  and  the  mean-adjusted  squared  series. 
Interpret  the  plots. 

(d)  Use  ARCH-LM  and  ARCH-portmanteau  tests  for  the  mean-adjusted  se¬ 
ries  and  interpret  the  results.  Apply  the  tests  to  the  bivariate  and  the  two 
univariate  series  separately  and  compare  the  results. 

(e)  Fit  a  bivariate  BEKK  GARCH(1, 1)  model  to  the  bivariate  series. 

(f)  Perform  model  specification  tests  based  on  the  residuals  of  the  estimated 
MGARCH  model  and  interpret  the  results. 

(Hint:  A  similar  data  set  was  analyzed  by  Herwartz  (2004).) 
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Periodic  VAR  Processes  and  Intervention 
Models 


17.1  Introduction 

In  Part  II  of  the  book,  we  have  considered  cointegrated  VAR  models  and 
we  have  seen  that  they  give  rise  to  nonstationary  processes  with  potentially 
time  varying  first  and  second  moments.  Yet  the  models  have  time  invariant 
coefficients.  Nonstationarity,  that  is,  time  varying  first  and/or  second  moments 
of  a  process,  can  also  be  modelled  in  the  framework  of  time  varying  coefficient 
processes.  Suppose,  for  instance,  that  the  time  series  under  consideration  show 
a  seasonal  pattern.  In  that  case,  a  VAR(p)  process  with  different  intercept 
terms  for  each  season  may  be  a  reasonable  model: 

yt  =  vi  +  Aiyt-1  +  •  •  •  +  Apyt_p  +  ut.  (17.1.1) 

Here  vt  is  a  ( K  x  1)  intercept  vector  associated  with  the  z'-tli  season,  that  is, 
in  (17.1.1),  the  time  index  t  is  assumed  to  be  associated  with  the  i-th  season 
of  the  year.  It  is  easy  to  see  that  such  a  process  has  a  potentially  different 
mean  for  each  season  of  the  year. 

Assuming  s  seasons,  the  model  (17.1.1)  could  be  written  alternatively  as 
Ut  =  TiitVi  +  •  •  •  +  nstvs  +  A\yt-i  +  •  •  •  +  Apyt-p  +  «t, 


where 


riit  =  0  or  1  and  n.it  =  1.  (17.1.2) 

i= 1 

In  other  words,  Uu  assumes  the  value  of  1  if  £  belongs  to  the  7-th  season  and 
is  zero  otherwise,  that  is,  na  is  a  seasonal  dummy  variable. 

Of  course,  the  model  (17.1.1)  is  covered,  e.g.,  by  the  set-up  of  Chapter 
10.  In  a  seasonal  context,  it  is  possible,  however,  that  the  other  coefficients 
also  vary  for  different  seasons.  In  that  case,  a  more  general  model  may  be 
adequate: 
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Ut  —  vt  +  Aityt.-i  +  •  •  •  +  Aptyt-V  +  Ut 


(17.1.3) 


Bt  '■=  Wt,  Ait, . . . ,  Apt] 

=  TiitWii  An,  ■  ■  ■ ,  Api]  +  •  •  •  +  nst[vs,  Ais, . . . ,  Aps] 

=  fiuBi  +  •  •  •  +  nstBs 


(17.1.4) 


St  ■=  E(utu't)  =  nit.Si  H - h  nstSs. 


(17.1.5) 


Here  the  nlt  are  seasonal  dummy  variables  as  in  (17.1.2),  the  Bi  :=  \vi,Au, 
. . . ,  Apt]  are  ( K  x  ( Kp  +  1))  coefficient  matrices,  and  the  S,  are  ( K  x  K) 
covariance  matrices.  The  model  (17.1.3)  with  periodically  varying  coefficients 
as  specified  in  (17.1.4)/ (17.1.5)  is  a  general  periodic  VAR(p)  model ,  sometimes 
abbreviated  as  PAR(p),  with  period  s.  Varying  coefficient  models  of  this  type 
will  be  discussed  in  Section  17.3. 

The  model  (17.1.3)  can  also  be  used  in  a  situation  where  a  stationary,  sta¬ 
ble  data  generation  process  is  in  operation  until  period  Xi,  say,  and  then  some 
outside  intervention  occurs  after  which  another  VAR(p)  process  generates  the 
data.  This  case  can  be  handled  within  the  model  class  (17.1.3)  by  defining 
s  =  2, 


for  t  <  Ti, 
for  t>T\, 


for  t  >  Ti, 
for  t  <  T\. 


Intervention  models  of  this  type  will  be  considered  in  Section  17.4.  Interven¬ 
tions  in  economic  systems  may,  for  instance,  be  due  to  legislative  activities 
or  catastrophic  weather  conditions.  Of  course,  there  could  be  more  than  one 
intervention  in  the  stretch  of  a  time  series.  The  general  model  (17.1.3)  encom¬ 
passes  that  situation  when  the  dummy  variables  are  chosen  appropriately.  In 
Section  17.2,  some  properties  of  the  general  model  (17.1.3)  will  be  given  that 
can  be  derived  without  special  assumptions  regarding  the  movement  of  the 
parameters.  These  properties  are  valid  for  both  the  periodic  and  intervention 
models  discussed  in  Sections  17.3  and  17.4,  respectively. 

An  important  characteristic  of  periodic  and  intervention  models  is  that 
only  a  finite  number  of  regimes  exist  that  are  associated  with  specific,  known 
time  periods.  In  other  words,  the  coefficient  variations  are  systematic.  Such  a 
model  structure  is  not  realistic  in  all  situations  of  practical  interest.  We  will 
therefore  discuss  models  with  randomly  varying  coefficients  in  Chapter  18. 
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17.2  The  VAR(p)  Model  with  Time  Varying  Coefficients 


In  this  section,  we  consider  the  following  general  form  of  a  A'-diinensional 
VAR(p)  model  with  time  varying  coefficients: 

Vt  =  vt  +  Aityt-i  +  •  •  •  +  Aptyt_p  +  ut,  t  £  Z,  (17.2.1) 

where  Ut  is  a  zero  mean  noise  process  with  covariance  matrices  E(utu't)  =  Et. 
That  is,  the  ut  may  have  time  varying  covariance  matrices  and,  thus,  may 
not  be  identically  distributed.  We  retain  the  independence  assumption  for  ut 
and  u8,  s  ^  t.  Of  course,  the  constant  coefficient  VAR(p)  model  considered  in 
previous  chapters  is  a  special  case  of  (17.2.1).  Further  special  cases  are  treated 
in  the  next  sections.  We  will  now  discuss  some  properties  of  the  general  model. 


17.2.1  General  Properties 


To  derive  general  properties,  it  is  convenient  to  write  the  model  (17.2.1)  in 
VAR(l)  form: 


Yt  =  ut  +  AtYt_1  +  Ut, 
where 


Y*  := 


yt 

: 

,  ut  := 

Vt 

0 

2/t— p+i  J 

(ifp  X  1) 

_  0  _ 

(17.2.2) 


Apt  • 

■Ap— 1?£  1 

ut 

Ik 

0  0 

0 

At  := 

0 

■•he  o 

,  Pt  := 

0 

( Kp  x  Kp) 

By  successive  substitution  we  get 


{Kp  x  1) 


/  h— 1  \  h— 1  /  i— 1  \ 

i r  \  \  .  v  hr  a 


h— 1  /  i—  1 


Yt  —  i  n Yt~k + i  n  Ut~i + i  n  ut~i. 


[fi  ) 


[fi  ) 


(ri>.  \ 


i= 0  \j= 0 


t=0  \  .7  = 


V 


(17.2.3) 


Defining  the  ( K  x  Kp)  matrix  J  :=  [Ik  :  0]  such  that  yt  =  JYt  and  pre¬ 
multiplying  (17.2.3)  by  this  matrix  gives 


'  h- 1 


h- 1 


h- 1 


Ut  =  J  in  A t—j  J  Yt-h  +  'y  '  I’itVt-i  +  y  '  & itut—ii 

i  j= 0  /  i=0  i=0 


(17.2.4) 
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where 


A-t  \ 

& it  :=  J  I  I^[  A.t-j  I  J 
V=°  / 


(17.2.5) 


and  it  has  been  used  that  J'  JUt  =  Ut,  JUt  =  ut,  and  similar  results  hold  for 
ut.  If 


converges  to  a  constant,  say  /it,  for  h  — *  oo,  and  if  the  first  term  on  the 
right-hand  side  of  (17.2.4)  converges  to  zero  in  mean  square  and  the  last  term 
converges  in  mean  square  as  h  — >  oo,  we  get  the  representation 


Vt  =  lit  +  y ^@itut-i, 


(17.2.6) 


where  yt  =  E(yt).  In  the  following,  it  is  assumed  without  further  notice  that 
this  representation  exists. 

It  can  be  used  to  derive  the  autocovariance  structure  of  the  process.  For 
instance. 


E[{yt  -  m)(yt  -  ih)']  =  E  (  y^jtut  )  (  X/A/"/  , 


=  e  y;y>,,„  .x  yit 


j  =  0  2  =  0 


E[{yt  -  in)(yt-i  -  iit-i)']  =  E  I  y ^@jtut-j  j  (  y$i,t-iut-i-i 

A  1=0  /  \i=0  1 

oo  oo 

=  e  y  y  ju  t_i 

-1=-I  i=0 
oo 


More  generally,  for  some  integer  /i, 


flt){yt—h  Ht-h)  ]  —  ^  '^i+hitEt—h—i&i't—h- 


Usually  these  formulas  are  not  very  useful  for  actually  computing  the  auto¬ 
covariances.  They  show,  however,  that  the  autocovariances  generally  depend 
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on  t  and  h  so  that  the  process  yt  is  not  stationary.  In  addition,  of  course,  the 
mean  vectors  fit  may  be  time  varying. 

Optimal  forecasts  can  be  obtained  either  from  (17.2.1)  or  from  (17.2.6). 
In  the  former  case,  the  forecasts  can  be  computed  recursively  as 

Ut(h)  =  Vt+h  +  Ai  j.+hDtih  —  1)  +  •  •  •  +  APtt+hyt(h  —  p ),  (17.2.7) 

where  yt(j )  :=  yt+j  for  j  <  0.  Using  (17.2.6)  gives 

OO 

Vt(h)  =  H’t+h  H-  ^  ^  (17.2.8) 

i=h 

and  the  forecast  error  is 
h- 1 

yt+h  -  yt(h)  =  ^2  &i,t+hUt+h-i-  (17.2.9) 

i—0 

Hence,  the  forecast  MSE  matrices  turn  out  to  be 

h- 1 

st(h)  :=  MSE[yt(ft)]  =  £  ^t+hSt+h^t+h.  (17.2.10) 

*= o 

We  will  discuss  some  basics  of  ML  estimation  for  the  general  model  (17.2.1) 
next. 

17.2.2  ML  Estimation 

Although  specific  results  require  specific  assumptions,  it  is  useful  to  establish 
some  general  results  related  to  ML  estimation  for  Gaussian  processes  first. 
We  write  the  model  (17.2.1)  as 


Vt  -  BtZt-i  +  ut,  (17.2.11) 

where  Bt  :=  [vt,  Ait, . . . ,  Apt],  Zt_\  :=  (1,  Y[_A)\  and  we  assume  that  the 
(K  x  (Kp  +1))  matrices  Bt  depend  on  an  ( N  x  1)  vector  7  of  fixed,  time 
invariant  parameters,  that  is,  Bt  =  Bt{ 7).  Furthermore,  the  St  are  assumed 
to  depend  on  an  (M  x  1)  vector  cr  of  fixed  parameters.  The  vector  er  is  disjoint 
of  and  unrelated  with  7.  Examples  where  this  situation  arises  will  be  seen  in 
the  next  sections.  One  example,  of  course,  is  a  constant  coefficient  model, 
where  Bt  =  B  =  A.p]  and  Bt  =  ZJU  for  all  t.  Here  we  may  choose 

7  =  vec (B)  and  cr  =  vech(U.,i)  if  no  further  restrictions  are  imposed. 

Assuming  that  ut  is  a  Gaussian  noise  process,  that  is,  ut  ~  N (0,  St),  the 
log-likelihood  function  of  our  general  model  is 

T  T 

=  -iy-in27r~  ^Elnir*i - 

L  t= 1  z  t= 1 


ln/(7,cr) 


(17.2.12) 
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where  any  initial  condition  terms  are  ignored.  The  corresponding  normal  equa¬ 
tions  are 


0  = 


9  In  l 
dy 


and 


0  = 


9  In  l 

da 


_  ®ut  y-1 

~h  si 


=  £ 


0  vee(  Bt 


Zt  dl 


-{Zt-i®IK)Zt  Ut 


=  E 


dvec(Bty  ,  . 


ZT  dy 


utZt- 1 


--E 
2  ^ 
t 

--E 
2  ^ 
t 

■sE 


dvecfAt)'  9 In  | Et  \  ' 
da  dvec  (St) 

dvec(St)'  du!t£ylut 
da  dvec(Ut)  \ 


dve  c{St)' 

da 


vec(Zt  -St  utu’tSt  ) 


(17.2.13) 


(17.2.14) 


Even  if  d vec(Bt)' /dy  is  a  matrix  that  does  not  depend  on  y ,  (17.2.13)  is  in 
general  a  system  of  equations  which  is  nonlinear  in  y  and  er  because  ut  = 
yt  —  BtZt_ i  involves  y.  However,  we  will  see  in  the  next  sections  that  in  many 
cases  of  interest,  (17.2.13)  reduces  to  a  linear  system  which  is  easy  to  solve. 
Also,  a  solution  of  (17.2.14)  is  easy  to  obtain  under  the  conditions  of  the  next 
sections. 

It  is  furthermore  possible  to  derive  an  expression  for  the  information  ma¬ 
trix  associated  with  the  general  log-likelihood  function  (17.2.12).  The  second 
partial  derivatives  with  respect  to  y  are 


a2  ln^ 
dydy' 


-E 


dvec (Bt)',v  ^  ^dvec(Bt) 


dy 

terms  with  mean  zero 

E 


{r/t-i®lK)Bt  (^_i  <E)  lif)- 


dy' 


dvec (Bt)'  !  dvec.(Bt) 

(*)  ) 


dy 


terms  with  mean  zero. 


dy' 


(17.2.15) 


Assuming  that  dvec(St)'/da  does  not  depend  on  a  and,  thus,  the  second 
order  partial  derivatives  of  Bt  with  respect  to  the  elements  of  a  are  zero,  we 
get 
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d2  In  l 
dada' 


1  ^  ( d vec(St)' 

9 


da 


dvec(Et  ) 
da’ 

~{Ik  ®  ^ 

dveciS^1) 


-{Zt  1utu't  (g)  IK) 


da' 


\.Z 


dvec(Zty  i  !  !  i  ,  ! 


da 


~{St  (g)  Zt  St  (g)  Et  ut:utEt 


(17.2.16) 


The  assumption  of  zero  second  partial  derivatives  of  Et  with  respect  to  a  will 
be  satisfied  in  all  cases  of  interest  in  the  following  sections.  Furthermore,  it  is 
easy  to  see  that  under  the  present  assumptions 


E[d2  \wl / d^/da']  =  0. 


Consequently,  the  information  matrix  becomes 


^(7,  er) 


E 


d2(—  In  l) 

=  -E 

d2  In  l 

d 

7 

(7 

<9(y,cr')J 

d 

7 

a 

d{j',a') 

E 

t 

0 


£>v“(-B,)'  [s(zmz;_,)  ®  s-1]®vec(s‘) 


dj 


IZ 


dj' 


g  r_j)3vec(r,) 


da 


da' 


(17.2.17) 


Although  these  expressions  look  a  bit  unwieldy  in  their  present  general  form, 
they  are  quite  handy  if  special  assumptions  regarding  the  time  variations  of 
the  coefficients  are  made.  We  will  now  turn  to  such  special  types  of  time 
varying  coefficient  VAR  models. 


17.3  Periodic  Processes 

As  we  have  seen  in  Section  17.1,  in  periodic  VAR  or  PAR  processes  the  coef¬ 
ficients  vary  periodically  with  period  s,  say.  In  other  words, 

Ut  =  vt  +  AtYt~  i  +  ut ,  (17.3.1) 


where 
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f't  =  niti'i  H - 1-  nstvs,  ( K  x  1) 

At  =  [Alt , . . . ,  Apt]  =  nitAi  +  ■  ■  ■  +  nstAs,  (AT  x  Kp)  (17.3.2) 

St  =  E(utu't)  =  riuSi  H - 1-  nstSs ,  (Jv  x  A") 

and  the  Tin  are  seasonal  dummy  variables  which  have  a  value  of  one  if  t 
is  associated  with  the  i-tli  season  and  zero  otherwise.  Obviously,  the  gen¬ 
eral  framework  of  the  previous  section  encompasses  this  model.  Hence,  some 
properties  can  be  obtained  by  substituting  the  expressions  from  (17.3.2)  in 
the  general  formulas  of  the  previous  section.  For  periodic  processes,  however, 
many  properties  are  more  easily  derived  via  another  approach  which  will  be 
introduced  and  exploited  in  the  next  subsection. 

Special  models  arise  if  only  a  subset  of  the  parameters  vary  periodically. 
For  instance,  if  A)  =  Si  and  A,  =  Ai  for  i  =  1, . . . ,  s,  we  have  a  model  with 
seasonal  means  and  otherwise  time  invariant  structure.  Simplifications  of  this 
kind  are  useful  in  practice  because  they  imply  a  reduction  in  the  number  of 
free  parameters  to  be  estimated  and  thereby  result  in  more  efficient  estimates 
and  forecasts,  at  least  in  large  samples.  A  special  case  of  foremost  interest  is,  of 
course,  a  non-periodic,  constant  coefficient  VAR  model.  If  the  data  generation 
process  turns  out  to  be  of  that  type,  the  interpretation  and  analysis  is  greatly 
simplified.  We  will  consider  estimation  and  tests  of  various  sets  of  relevant 
hypotheses  in  Subsection  17.3.2. 


17.3.1  A  VAR  Representation  with  Time  Invariant  Coefficients 

Suppose  we  have  a  quarterly  process  with  period  s  =  4  and  yi  belongs  to  the 
first  quarter.  Then  we  may  define  an  annual  process  with  vectors 


2/4  1 

2/8  1 

2/4  T 

2/3 

,  1)2  := 

2/7 

,  .  .  .  ,  X)T  \ — 

2/4t— 1 

2/2 

2/6 

2/4t— 2 

.  yi  \ 

.  2/5  J 

2/4r— 3 

This  process  has  a  representation  with  time  invariant  coefficient  matrices.  For 
instance,  if  the  process  for  each  quarter  is  a  VAR(l), 

Ut  =  +  AitVt-i  +  ut 

=  Vi  +  Ai^yt-i  +  Ut,  if  t  belongs  to  the  i-th  quarter, 
then  the  process  has  the  representation 


o 

o 

1 

2/4  T 

0  IK  —  Ai>3  0 

2/4r—  1 

O 

O 

>T 

1 

4o 

2/4r  — 2 

o 

o 

o 

>s 

2/4t— 3 
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V\  0  0  0  0  2/4t-4  lt4T 

^3  0  0  0  0  2/4r  —  5  M4r~l 

^2  0  0  0  0  |/4r-6  «4r-2 

z^l  _  _  -Ai;i  0  0  0  |/4r-7  _  _  Uir-3 

More  generally,  if  we  have  s  different  regimes  (seasons  per  year)  with  con¬ 
stant  parameters  within  each  regime  and  if  we  assume  that  y±  belongs  to  the 
first  season,  we  may  define  the  sA'-dimensional  process 

y sr 
y  st — i 

Dr  :=  1  T  =  0,  ±1,  ±2, - 

ysT—s+i 
( sK  X  1) 

This  process  has  the  following  VAR(P)  representation,  where  P  is  the  smallest 
integer  greater  than  or  equal  to  p/s: 

2lo  Dt  =  v  +  +  •  •  •  +  21pt)r_p  +  ur,  (17.3.4) 

where 


z's-i 

(sK  x  1) 


USr 

^ST-l 

^ST-S  + 1 
(sKx  1) 


All  A,  j’s  with  i  >  p  are  zero. 

The  process  r)r  is  stationary  if  the  yt’s  have  bounded  first  and  second 
moments  and  the  VAR  operator  is  stable,  that  is, 

det(2to  —  2ti  z  —  ■  ■  ■  —  21  pzp) 

=  det {IsK  -  % l%xz - 21q  12lPzp)  ^  0  for  |z|  <  1.  (17.3.5) 
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Note  that  det(2lo)  =  1- 

For  the  example  process  (17.3.3),  we  have 

J k  7li4  Ai4,Ait3  Ai^Ai^Aiz 

9|-i  _  0  Ik  Ai,3  -di, 3^.1, 2 

0  “  0  0  IK  Ah2 

0  0  0  IK 

and,  thus, 

-^1, 4-^1, 3-^1, 2-^1, 1  0  0  0 

-^1, 3-^1, 2-^1, 1  0  0  0 
A\  2A\  \  0  0  0 

A1A ’  000 

Hence, 

det(2to  —  2liz)  =  det (Ik  —  Ai^Ai^Ai^Ai^z)  ^  0  for  \z\  <  1 

is  the  stability  condition  for  the  example  process.  If  this  condition  is  satisfied, 
we  can,  for  instance,  compute  the  autocovariances  of  the  process  t)T  in  the 
usual  way.  Note,  however,  that  stationarity  of  t)r  does  not  imply  stationarity 
of  the  original  process  yt .  Even  if  t)r  has  a  time  invariant  mean  vector 


r  /mi 


Lm  J 


for  example,  the  mean  vectors  /x 4  and  ^  associated  with  the  fourth  and 
third  quarters,  respectively,  may  be  different.  Similar  thoughts  apply  for  other 
quarters  and  for  the  autocovariances  associated  with  different  quarters. 

The  process  t)T  corresponding  to  a  periodic  process  yt  can  also  be  used  to 
determine  an  upper  bound  for  the  order  p  of  the  latter.  If  is  stationary  and 
its  order  P  is  selected  in  the  usual  way,  we  know  that  p  <  sP. 

Optimal  forecasts  of  a  periodic  process  are  easily  obtained  from  the  recur¬ 
sions  (17.2.7).  Assuming  that  the  forecast  origin  t  is  associated  with  the  last 
period  of  the  year,  we  get 

Vt{  1)  =  vi  +  A\iyt  +  ■  ■  ■  +  APjiyt_p-|_i 

Vt(  2)  =  1^2  +  Aii2j/x(l)  +  •  •  •  +  APt2yt-p+2 

yt(s )  =  vs  +  AUsyt(s  -  1)  H - h  AP'Syt(s  -  p) 

yt(s  +  1)  =  zq  +  Aijyt(s)  -\ - APtiyt(s  +  1  -  p) 
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17.3.2  ML  Estimation  and  Testing  for  Time  Varying  Coefficients 

The  general  framework  for  ML  estimation  of  the  periodic  VAR(p)  model  given 
in  (17.3.1)/(17.3.2),  under  Gaussian  assumptions,  is  laid  out  in  Section  17.2.2. 
For  the  present  case,  however,  a  number  of  simplifications  are  obtained  and 
closed  form  expressions  can  be  given  for  the  estimators.  In  the  following,  we 
discuss  estimation  under  various  types  of  restrictions  and  we  consider  tests 
of  time  invariance  of  different  groups  of  coefficients.  Most  of  the  tests  are 
likelihood  ratio  (LR)  tests,  the  general  form  of  which  is  discussed  in  Appendix 
C.7.  Recall  that  the  LR  statistic  is 


Aifl  =  2[ln/(5)-lnZ(5r)], 


(17.3.6) 


where  d  is  the  unconstrained  ML  estimator  and  dr  is  the  restricted  ML  estima¬ 
tor  obtained  by  maximizing  the  likelihood  function  under  the  null  hypothesis 
Hu.  If  Hi)  is  true,  under  general  conditions,  the  LR  statistic  has  an  asymp¬ 
totic  x2-distribution  with  degrees  of  freedom  equal  to  the  number  of  linearly 
independent  restrictions.  In  the  following,  we  will  give  the  maximum  of  the 
likelihood  function  under  various  sets  of  restrictions  in  addition  to  the  ML 
estimators.  These  results  will  enable  us  to  set  up  LR  tests  for  different  sets  of 
restrictions. 

For  the  present  case  of  a  periodic  VAR  model,  the  normal  equations  given 
in  (17.2.13)  reduce  to 


T 


_  dlnl  _^dvec(nltB1-\ - (-  nstBs)'  j.  , 

0  “  ^  4  t_1 
s  T 


Y"  VW.  dvec(Bi)1  r/, 

—  2-^A-^nit  Ut^t- 1) 


i— 1  t— 1 


d~f 


(17.3.7) 


where  Bt  :=  [iy,  Alt, . . . ,  Api]  :=  [vu  A,],  i  =  1, . . . ,  s,  and 


-l 


=  YlnitSi  1 

i 


has  been  used.  Moreover,  (17.2.14)  reduces  to 


0  = 


din  l 

~d, V 


nit 


dvec(ZJi)' 

far 


vec(Ai  1  -  Si  lutu’tSi  ^ 


(17.3.8) 


We  will  see  in  the  following  that  the  solution  of  these  sets  of  normal  equations 
is  relatively  easy  in  many  situations.  The  discussion  follows  Lutkepohl  (1992). 
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All  Coefficients  Time  Varying 

We  begin  with  a  periodic  VAR(p)  model  for  which  all  coefficients  are  time 
varying,  that  is, 

S  S 

Hi:  Bt  =  [ vt,At\  =  YnitBj,  St  =  (17.3.9) 

i—1  i=  1 


For  this  case 

7  =  vec[Bi,  ...,B8] 


and 


cr  =  [vech(A'l),, . . . ,  vech(I7s),],• 

Using  a  little  algebra,  the  ML  estimators  can  be  obtained  from  (17.3.7)  and 
(17.3.8): 


Bi1]  =  [j2nitytZt-i  )  \  ^2nitZt-iZ't_ i 


(17.3.10) 


and 


41}  =  En^  -  Bi1)zt-i)(yt  -  B^Zt^y/Tnu  (17.3.11) 


for  i  =  1, . . . ,  s.  Here  rii  =  ^t=1  nu/T.  Except  for  an  additive  constant,  the 
corresponding  maximum  of  the  log-likelihood  function  is 

Ai  :=  E ln  =  -\T^  ln  l^hl  +  •  •  •  +  na  In (17.3.12) 


All  Coefficients  Time  Invariant 

The  next  case  we  consider  is  our  well-known  basic  stationary  VAR(p)  model, 
where  all  the  coefficients  are  time  invariant: 

H2:Bi  =  B1,  Si  =  S i,  i  =  2, . . . ,  s.  (17.3.13) 

For  this  case,  we  know  that  the  ML  estimators  are 

Bi2)= 


(17.3.14) 


and 
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Si]  =  E^‘  ~  Bi)zt-i)(yt  -  B^Zt^Y/T.  (17.3.15) 

t 

The  maximum  log- likelihood  is,  except  for  an  additive  constant, 

A2  :=  -|rin|^2)|.  (17.3.16) 

This  case  is  considered  here  because  H2  is  a  null  hypothesis  of  foremost  interest 
in  the  present  context.  Of  course,  if  it  turns  out  that  H2  is  true,  we  can 
proceed  with  a  standard  VAR  analysis.  The  slight  change  of  notation  relative 
to  previous  chapters  is  useful  here  to  avoid  confusion. 

Time  Invariant  White  Noise 

If  just  the  white  noise  covariance  matrix  is  time  invariant  while  the  other 
coefficients  vary,  we  have 

S 

H3  :  Bt  =  [vt,At\  =  YnuB,  and  =  B1,  i  =  2, . . . ,  s.  (17.3.17) 
2=1 

For  this  case,  it  follows  from  (17.3.7)  that  the  ML  estimators  of  the  are 
B™=B?\  i=l,...,s,  (17.3.18) 

and  (17.3.8)  implies 

s  T 

^3)  =  E  E  «**(»*  -  B<t)zt-i ){yt  ~  B^Zt-iY/T.  (17.3.19) 

2=1  t=  1 

The  resulting  maximum  log-likelihood  turns  out  to  be 

A3  :=  -|Tln|^3)|,  (17.3.20) 

where  again  an  additive  constant  is  suppressed. 


Time  Invariant  Covariance  Structure 

If  just  the  intercept  terms  and,  hence,  the  means  are  time  varying,  we  have  the 
conventional  case  of  a  model  with  seasonal  dummies  and  otherwise  constant 
coefficients.  In  the  present  framework,  this  situation  may  be  represented  as 

S 

Hu  :  vt  =  y;  riitVi  and  At  =  A\ ,  V,  =  Vi,  i  =  2, . . . ,  s.  (17.3.21) 

i—l 

Under  this  hypothesis,  the  ML  estimators  are  easily  obtained  by  defining 
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«i,t 


Wt- 1  = 


ns,t 


and  C  =[v1,...,vs,A1}. 


The  ML  estimator  of  C  is 

c=  {r^ytWU^j  (^2wt-iWL^j 

and  that  of  Mi  is 

^4)  =  -  CWt^)(yt  -  CWt-±)'/T. 

t 


(17.3.22) 


(17.3.23) 


Dropping  again  an  additive  constant,  the  corresponding  maximum  of  the  log- 
likelihood  function  is 

A4  :=  -|Tln|^4)|.  (17.3.24) 


LR  Tests 

In  Table  17.1,  the  LR  tests  of  some  hypotheses  of  interest  are  listed.  The  LR 
statistics,  under  general  conditions,  all  have  asymptotic  ^-distributions  with 
the  given  degrees  of  freedom.  For  this  result  to  hold,  it  is  important  that  the 
nt  are  approximately  equal  for  i  =  1, . . . ,  s,  as  assumed  in  periodic  models. 
Moreover,  the  corresponding  \)T  process  is  assumed  to  be  stable.  In  Chapter 
8,  Section  8.4.3,  we  have  argued  that  if  integrated  variables  are  involved  and 
VECMs  are  considered,  the  degrees  of  freedom  of  Chow  tests  have  to  be 
adjusted  relative  to  the  stable  case.  Because  Chow  tests  are  formally  similar  to 
some  of  the  tests  considered  here,  it  is  perhaps  not  surprising  that  adjustments 
will  also  be  necessary  in  the  present  case  if  /(l)  variables  are  involved.  The 
reader  is  invited  to  check  the  degrees  of  freedom  listed  in  Table  17.1  for  the 
case  of  a  stable  underlying  t)r  process  by  counting  the  number  of  restrictions 
imposed  under  the  null  hypothesis. 

In  Chapter  4,  Section  4.6.1,  we  have  argued  that  the  asymptotic  distribu¬ 
tions  of  similar  LR  tests  are  poor  guides  for  the  actual  small  sample  distribu¬ 
tions.  Therefore,  the  same  problem  must  be  expected  to  prevail  in  the  present 
case.  Using  bootstrap  versions  of  the  present  tests  may  improve  the  situation 
(see  Appendix  D.3). 

Testing  a  Model  with  Time  Varying  Error  Covariance  Matrix 
Only  Against  One  Where  All  Coefficients  Are  Time  Varying 

ML  estimation  of  models  for  which  all  coefficients  are  time  invariant  except 
for  the  error  covariance  matrix  is  complicated  by  the  implied  nonlinearity  of 
the  normal  equations  (17.3.7).  Thus,  LR  tests  involving  the  hypothesis 
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Table  17.1.  LR  tests  for  time  varying  parameters 


null 

hypothesis 

alternative 

hypothesis 

LR  statistic 

A  LR 

degrees  of  freedom 

h2 

Hi 

2(Ai  -A2) 

(s-l)K[K(p  +§)  +  §] 

Hi 

Hi 

2(Ai  -A3) 

(s-l)K{K  +  l)/2 

Hi 

Hi 

2(Ai  -  A4) 

(s  —  l)K[Kp  +  (K  +  l)/2] 

h2 

Hi 

2(A3  —  A2) 

(s  —  l)K(Kp  +  1) 

h2 

H4 

2(\i  —  X2) 

(s  -  1  )K 

i?5  :  Bi  =  Bi,  i  =  2,..,,s  and  St  =  rigBi  (17.3.25) 

2  =  1 

are  computationally  unattractive.  If  we  wish  to  test  H5  against  a  model  for 
which  all  parameters  are  time  varying  (H-}  against  H i),  estimation  under  the 
alternative  is  straightforward  and,  therefore,  a  Wald  test  may  be  considered. 

Just  as  a  reminder,  if  the  unrestricted  estimator  7  of  a  parameter  vector 
7  has  an  asymptotic  normal  distribution, 


Vf(7-7)-^^(  0,£y), 

and  the  restrictions  under  the  null  hypothesis  are  given  in  the  form  Rry  =  0, 
then  the  Wald  statistic  is  of  the  form 


=  Ti'R'{REzlB!)-1Erf, 


(17.3.26) 


where  is  a  consistent  estimator  of  A~.  If  rk(i?)  =  N ,  RH^R'  is  invertible, 
and  the  null  hypothesis  is  true,  the  Wald  statistic  has  an  asymptotic  X2{N)- 
distribution  (see  Appendix  C.7). 

In  the  case  of  interest  here,  the  restrictions  relate  to  the  VAR  coefficients 
and  intercept  terms  only.  Therefore,  we  consider  the  s(K2p  +  Ji)-dimensional 
vector  7  =  vec(Bi, . . . ,  Bs\.  The  restrictions  under  the  null  hypothesis  H 5  can 
be  written  as  R"f  =  0  with 


R  = 


1  -1  0 

1  0  -1 

((a-l)xs) 


®Ik2p+k- 


(17.3.27) 


Denoting  the  unrestricted  ML  estimator  of  7  by  7,  standard  asymptotic  the¬ 
ory  implies  that  it  has  an  asymptotic  normal  distribution  with 

Sy  =  lim  TX{ 7)"1 
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and 


X(7)  =  -E 


'd2  In/' 
djd~/' 


is  the  upper  left-hand  block  of  the  information  matrix  (17.2.17).  For  the 
present  case,  1(7)  is  seen  to  be  block  diagonal  with  the  <-tli  (( K2p  +  K)  x 
(K2p  +  K))  block  on  the  diagonal  being 


-E 


d2  In  / 

dlidl'A 


=  E 


I  E7 


where  7i  =  vec(_B,).  Thus  is  also  block-diagonal  and,  under  standard 
assumptions,  the  i-th  block  is  consistently  estimated  by 


The  resulting  estimator  of  A7y  may  be  used  in  (17.3.26).  If  H5  is  true,  Aw 
has  an  asymptotic  ^-distribution  with  (s  —  l)K(Kp+  1)  degrees  of  freedom. 


Testing  a  Time  Invariant  Model  Against  One  with  Time  Varying 
Error  Covariance 


In  order  to  test  a  stationary  constant  parameter  model  (H 2)  against  one, 
where  the  error  covariances  vary  {H 5),  an  LM  (Lagrange  multiplier)  test  is 
convenient  because  it  requires  ML  estimation  under  the  null  hypothesis  only. 
In  Appendix  C.7,  the  general  form  of  the  LM  statistic  is  given  as 

A  lm  =  s(jr,aryi(jr,ar)~1s(:yr,crr),  (17.3.28) 


where  T(7r,  err)  is  the  information  matrix  of  the  unrestricted  model  evaluated 
at  the  restricted  ML  estimators  obtained  under  the  null  hypothesis  and 


s(7,er) 


din/ 
dj 
din  / 

1/eT  - 


is  the  score  vector  of  first  order  partial  derivatives  of  the  log-likelihood  func¬ 
tion.  In  the  present  case,  7  =  vec(-Bi)  is  left  unrestricted.  Thus,  7r  =  7 
and 
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Consequently,  defining  a  =  (vech(17i)/, . . . ,  veches)')',  the  LM  statistic  re¬ 
duces  to 


d2  In  l 
dada' 


d  In  l 

~d, V 


(17.3.29) 


with  all  derivatives  evaluated  at  the  restricted  estimator, 
vech(JCf^) 


—  (2) 

<J  r  =  <X  .= 


vech(Z,|2^) 
From  (17.3.8),  we  see  that 


{\sK{K+  1)  x  1). 


d- 


dl  l  i  T 

=  -  2  E  nitB'Kvec(E^  -  Er\tU'tE^),  (17.3.30) 


where  =  3vec(i7i)/9vech(X'i)'  is  the  ( K 2  x  \K(K  +  1))  duplication 
matrix,  as  usual.  Furthermore,  for  the  present  case, 


dvec(Et)' 

da 


iiuB’k 

nstT)'K 


{\sK(K  +  1)  x  K2). 


Thus,  we  get  from  (17.2.17), 


-E 


d 2  In  l 
dada' 


\TnjyK{E^  ®  E^)T>k 


L  o 

which  implies 


-E 


r  a2  in; 


-i 


dada' 


0 

iTnsD'^E-1  ®  E-^Vk 


2D  +  (E,  ®  TOD+'/Tni 


0 


0 


2D^(US  ®  Ea)Jy^ /Tns 


(17.3.31) 


where  D^-  is  the  Moore-Penrose  inverse  of  D k-  Using  (17.3.30)  and  (17.3.31) 
with  ut  replaced  by  ut  =  yt  —  B]1’  Zt_  i  and  Et  replaced  by  E2-1  the  LM 
statistic  in  (17.3.29)  is  easy  to  evaluate.  Under  the  null  hypothesis  H2  and 
general  conditions,  it  has  an  asymptotic  ^-distribution  with  (s—  1)K(K+1) /2 
degrees  of  freedom. 
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17.3.3  An  Example 

The  previously  considered  theoretical  concepts  shall  now  be  illustrated  by  an 
example  from  Liitkepohl  (1992).  We  use  the  first  differences  of  logarithms  of 
quarterly,  seasonally  unadjusted  West  German  income  and  consumption  data 
for  the  years  1960-1987  given  in  File  E4.  The  two  series  are  plotted  in  Figure 
17.1.  Obviously,  they  exhibit  a  quite  strong  seasonal  pattern. 

There  are  various  problems  that  may  be  brought  up  with  respect  to  the 
data.  For  instance,  it  is  possible  that  the  logarithms  of  the  original  series 
are  cointegrated  (see  Part  II).  In  that  case,  fitting  a  VAR  process  to  the 
first  differences  may  be  inappropriate.  Also,  there  may  be  structural  shifts 
during  the  sample  period.  We  ignore  such  problems  here  because  we  just  want 
to  provide  an  illustrative  example  for  the  theoretical  results  of  the  previous 
subsections. 

Because  we  have  quarterly  data,  the  period  s  =  4  is  given  naturally.  Stack¬ 
ing  the  variables  for  each  year  in  one  long  8-dimensional  vector  t)r,  as  in  Sec¬ 
tion  17.3.1,  we  just  have  27  observations  for  each  component  of  t)r.  (Note 
that  the  first  value  of  the  series  is  lost  by  differencing.)  Thus,  the  largest  full 
VAR  process  that  can  be  fitted  to  the  8-dimensional  system  is  a  VAR(3).  In 
such  a  situation,  application  of  model  selection  criteria  is  a  doubtful  strategy 
for  choosing  the  order  of  t)r.  Because  we  want  to  test  the  null  hypothesis  of 
constant  coefficients,  it  may  be  reasonable  to  choose  the  VAR  order  under 
the  null  hypothesis,  that  is,  to  assume  a  constant  coefficient  model  at  the 
VAR  order  selection  stage.  Therefore  we  have  fitted  constant  coefficient  VAR 
models  to  the  bivariate  yt  series  consisting  of  the  quarterly  income  and  con¬ 
sumption  variables.  FPE,  AIC,  HQ,  and  SC  all  have  chosen  the  order  p  =  5 
when  a  maximum  of  8  was  allowed.  Of  course,  this  may  not  mean  too  much  if 
the  coefficients  are  actually  time  varying.  The  order  5  seems  to  be  a  reason¬ 
able  choice,  however,  because  it  means  that,  for  each  observation,  lags  from  a 
whole  year  and  the  corresponding  quarter  of  the  previous  year  are  included. 
Therefore,  we  will  work  with  p  =  5  in  the  following. 

The  first  test  we  carry  out  is  one  of  against  Hi,  that  is,  a  constant 
coefficient  model  is  tested  against  one  where  all  the  coefficients  are  time  vary¬ 
ing.  Note  that  we  use  the  order  p  =  5  also  for  the  model  with  time  varying 
coefficients.  The  test  value  A lr  =  2(Ai  —  A2)  =  223.79  is  clearly  significant 
at  the  1%  level  because  in  this  case  the  number  of  degrees  of  freedom  of 
the  asymptotic  ^-distribution  is  75.  Thus,  we  conclude  that  at  least  some 
coefficients  are  not  time  invariant.  To  see  whether  the  noise  series  may  be 
regarded  as  stationary,  we  also  test  Hj,  against  H\.  The  resulting  test  value 
is  A  lr  =  2(Ai  —  A3)  =  35.95  which  is  also  significant  at  the  1%  level  because 
we  now  have  9  degrees  of  freedom.  Next  we  use  the  Wald  test  described  in 
Section  17.3.2  to  see  whether  the  VAR  coefficients  and  intercept  terms  may 
be  assumed  to  be  constant  through  time.  In  other  words,  we  test  H5  against 
H 1.  The  test  value  becomes  A w  =  347.  Comparing  this  with  critical  values 
from  the  x2(66)-distribution,  we  again  reject  the  null  hypothesis  H§  at  the 
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1%  level.  The  reader  is  invited  to  perform  further  tests  on  these  data.  The 
tests  performed  so  far  support  a  full  periodic  model.  Notice,  however,  that  our 
tests  are  based  on  asymptotic  theory.  Their  actual  distributions  in  samples  as 
small  as  the  present  one  are  unclear  and,  in  any  case,  they  are  not  likely  to 
be  well  approximated  by  the  asymptotic  ^-distributions.  Thus,  it  is  not  clear 
how  much  evidence  in  favor  of  a  full  periodic  model  the  tests  actually  provide 
in  this  specific  case. 

Of  course,  it  is  possible  that  a  periodic  model  does  not  adequately  capture 
the  characteristics  of  the  data  generating  process.  In  that  case,  the  tests  may 
not  have  much  relevance.  To  check  the  adequacy  of  a  periodic  model,  similar 
tools  may  be  used  as  in  the  stationary  nonperiodic  case.  For  instance,  a  resid¬ 
ual  analysis  could  be  performed  in  a  similar  fashion  as  for  nonperiodic  VAR 
models.  The  properties  of  tests  for  model  adequacy  may  be  derived  from  the 
stationary  representation  of  the  annual  process  t)r. 

17.3.4  Bibliographical  Notes  and  Extensions 

Early  discussions  of  periodic  time  series  models  include  those  by  Gladyshev 
(1961)  and  Jones  &  Brelsforcl  (1967).  Pagano  (1978)  studied  properties  of 
periodic  autoregressions  while  Cleveland  &  Tiao  (1979)  considered  periodic 
univariate  AR.MA  models  and  Tiao  &  Grupe  (1980)  explored  the  consequences 
of  fitting  nonperiodic  models  to  data  generated  by  a  periodic  model.  Cipra 
(1985)  discussed  inference  for  periodic  moving  average  processes  and  Li  &  Hui 
(1988)  developed  an  algorithm  for  ML  estimation  of  periodic  ARMA  models. 
A  Bayesian  analysis  of  periodic  autoregressions  was  given  by  Andel  (1983, 
1987)  and  an  application  of  periodic  modelling  can  be  found,  for  instance, 
in  Osborn  &  Smith  (1989).  More  recently,  periodic  models  for  integrated  and 
cointegrated  variables  were  also  considered  (e.g.,  Herwartz  (1995),  Boswijk  & 
Franses  (1995,  1996),  Boswijk,  Franses  &  Haldrup  (1997),  Ghysels  &  Osborn 
(2001,  Chapter  6)).  The  last  publication  also  includes  many  more  references 
related  to  periodic  time  series  models. 


17.4  Intervention  Models 

In  Section  17.1,  an  intervention  model  was  described  as  one  where  a  particular 
stationary  data  generation  mechanism  is  in  operation  until  period  1  j,  say,  and 
another  process  generates  the  data  after  period  J\.  For  instance, 

Ut  =  v l  +  Ai Yt~i  +  ut,  E(utu't)  =  Ei,  t  <  1\  (17.4.1) 

and 

yt  =  v 2  +  A2Yt-i  +  ut,  E{utu't)  =  E2,  t  >  J\.  (17.4.2) 

In  the  present  case,  it  makes  a  difference  whether  the  intervention  is  modelled 
within  the  intercept  form  of  the  process  like  in  (17.4.1)/(17.4.2)  or  within  a 
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mean-adjusted  representation.  We  will  consider  both  cases  in  turn,  following 
Liitkepohl  (1992). 


17.4.1  Interventions  in  the  Intercept  Model 


Before  we  consider  more  general  situations,  it  may  be  useful  to  study  the 
case  described  by  (17.4.1)  and  (17.4.2)  in  a  little  more  detail.  For  simplicity, 
suppose  that  A 2  =  A\  and  £ 2  =  E  so  that  there  is  just  a  shift  in  the  intercept 
terms.  Moreover,  we  assume  that  the  process  is  stable.  In  this  case,  the  mean 
of  yt  is 


E(Vt)  =  < 


OO 


E  $iv  i> 

7=0 


t<Tu 


t—l\  00 

E  $iv 2+  E 

7=0 


t  >  Ti, 


where  the  ^,’s  are  the  coefficient  matrices  of  the  moving  average  representa¬ 
tion  of  the  mean-adjusted  process,  i.e., 


OO 

=  {Ik  ~  Anz - Aplzp)~1. 

i= 0 

Hence,  after  the  intervention,  the  process  mean  does  not  reach  a  fixed  new 
level  immediately  but  only  gradually, 

OO 

E{yt) ; — *  5>n,. 

t—> 00  L ' 

7=0 

In  the  more  general  situation,  where  all  coefficients  change  due  to  the 
intervention,  similar  results  also  hold  for  the  autocovariance  structure.  Of 
course,  such  a  behavior  may  be  quite  plausible  in  practice  because  a  system 
may  react  slowly  to  an  intervention.  On  the  other  hand,  it  is  also  conceivable 
that  an  abrupt  change  occurs.  For  the  case  of  a  change  in  the  mean,  we  will 
discuss  this  situation  in  Section  17.4.2. 

Before  discussing  that  case,  we  note  that  the  model  setup  considered  in 
Section  17.3  may  be  used  for  intervention  models  as  well  with  properly  spec¬ 
ified  Hu,  as  mentioned  in  Section  17.1.  The  hypotheses  considered  in  Section 
17.3.2  are  also  of  interest  in  the  present  context.  The  test  statistics  may  be 
computed  with  the  same  formulas  as  in  Section  17.3.2  and  the  tests  are  of¬ 
ten  referred  to  as  Chow  tests  (see  also  Chapter  4,  Section  4.6.1).  However, 
the  test  statistics  do  not  necessarily  have  the  indicated  asymptotic  distribu¬ 
tions  in  the  present  case.  The  problem  is  that  the  ML  estimators  given  in 
the  previous  section  may  not  be  consistent  anymore.  To  see  this,  consider,  for 
instance,  the  hypothesis  H\  (all  coefficients  time  varying)  and  the  model  in 
(17.4.1)/(17.4.2).  If  Xj  is  some  fixed  finite  point  and  T  >  1\, 
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Bi  =  ft.Ai]  = 

will  not  be  consistent  because  the  sample  information  regarding  B\  :=  [iq,  A{\ 
does  not  increase  when  T  goes  to  infinity.  As  a  way  out  of  this  problem  it 
may  be  assumed  that  1  j  increases  with  T.  For  instance,  1\  may  be  a  fixed 
proportion  of  T.  Then,  under  common  assumptions, 


Also  asymptotic  normality  is  easy  to  obtain  in  this  case  and  the  test  statistics 
have  the  limiting  ^-distributions  obtained  from  the  results  in  Section  17.3.2. 

A  logical  problem  may  arise  if  more  than  one  intervention  is  present.  In 
that  case,  it  may  not  be  easy  to  justify  the  assumption  that  all  subperiods 
approach  infinity  with  the  sample  period  T.  Whether  or  not  this  is  a  problem 
of  practical  relevance  must  be  decided  on  the  basis  of  the  as  yet  unknown  small 
sample  properties  of  the  tests.  In  any  event,  the  large  sample  x2-distributions 
are  just  meant  to  be  a  guide  for  the  small  sample  performance  of  the  tests 
and  as  such  they  may  be  used  if  the  periods  between  the  interventions  are 
reasonably  large.  Unfortunately,  as  mentioned  in  Chapter  4,  Section  4.6.1, 
the  asymptotic  x2-distributions  of  the  test  statistics  are  not  likely  to  be  good 
approximations  to  the  actual  small  sample  distributions  if  systems  of  variables 
are  considered. 

17.4.2  A  Discrete  Change  in  the  Mean 

We  have  seen  that  in  an  intercept  model  like  (17.4.1) / (17.4.2)  the  mean  grad¬ 
ually  approaches  a  new  level  after  the  intervention.  Occasionally,  it  may  be 
more  plausible  to  assume  that  there  is  a  one-time  jump  in  the  process  mean 
after  time  1  j.  In  such  a  situation,  a  model  in  mean-adjusted  form, 

Ut  —  —  [it-  l)  +  •  •  •  +  Ap(yt-p  —  [it-p)  +  «t,  (17.4.3) 

is  easier  to  work  with.  Here  [it  E(yt)  and,  for  simplicity,  it  is  assumed 
that  all  other  coefficients  are  time  invariant  and  that  the  process  is  stable. 
Therefore,  the  second  subscript  is  dropped  from  the  VAR  coefficient  matrices. 
We  also  assume  Gaussian  white  noise  Ut  with  time  invariant  covariance,  Ut  ~ 
Af(0,Su).  Suppose 

S 

fit  =  nitfii  H - 1-  n8tfis,  nit  =  0  or  1,  ^  nit  =  1.  (17.4.4) 

i=  1 

In  other  words,  there  are  s  interventions  so  that  for  each  i,  the  nit’s,  t  = 
1 , ,T,  are  a  sequence  of  zeros  and  ones,  the  latter  appearing  in  consecutive 
positions. 
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In  general,  exact  ML  estimation  of  the  model  (17.4.3)  results  in  nonlinear 
normal  equations.  To  avoid  the  use  of  nonlinear  optimization  algorithms,  the 
Hi  s  may  be  estimated  by 


Mi  = 


1 

Tni 


T 

X! mtyt > 

t= 1 


i  =  ly..,s. 


(17.4.5) 


Provided  Tni  =  ^2t  riu  approaches  infinity  with  T,  it  can  be  shown  that  under 
general  assumptions,  Ji,  is  consistent  and 

y/TfUdk  -  im)  ±  A7(0,^),  (17.4.6) 


where 


Ep  =  (IK-A1 - Ap)-1£u{1k  -A! - A.,)'-1 

(see  Chapter  3,  Section  3.3,  and  Problem  17.6).  Note  that  the  asymptotic  co- 
variance  matrix  does  not  depend  on  i.  Furthermore,  the  Hi  are  asymptotically 
independent.  Hence,  it  is  quite  easy  to  perform  a  Wald  test  of  the  hypothesis 


HG:Hi=Hi,  i  =  2,...,s  or  R 


Mi 


=  0, 


(17.4.7) 


where  R  has  a  similar  structure  as  in  (17.3.27).  The  corresponding  Wald 
statistic  is 


Aw  =  Tli/uln'i,  •  ■  • ,  Vn~sH's]R'[R(R  ®  Eji)R']  1R 


V^iMi 

y/n~sHs 


(17.4.8) 


where  [R(IS  ®  Sp)R']  1  reduces  to 


2  1  ...  1 

1  2  1 

1  1  ...  2 


and  ZJp  is  estimated  in  the  usual  way.  In  other  words, 


A  = 


-l 


and 
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2U  =  J2(yt  -  AYt-1  )(m  -  AYt^y/T, 

t 

where  yt  :=  yt  -  Th  and 


Yt-i  := 


yt- 1  -  yt- 1 


yt— p  fj't—p 


Under  H6,  Aw  has  an  asymptotic  ^-distribution  with  (s  —  1  )K  degrees  of 
freedom,  if  the  VAR  process  is  stable. 


17.4.3  An  Illustrative  Example 

As  an  example  of  testing  for  structural  change  in  the  present  framework,  we 
consider  again  the  seasonally  adjusted  quarterly  West  German  investment, 
income,  and  consumption  data  given  in  File  El.  The  data  were  first  used  in 
Chapter  3.  As  in  that  and  some  other  chapters,  we  perform  the  analysis  for 
the  first  differences  of  logarithms  (rates  of  change)  of  the  data.  In  Chapter 
4,  tests  for  a  structural  break  after  the  year  1978  when  the  second  oil  price 
crisis  occurred  were  already  performed.  We  will  now  consider  different  pairs 
of  hypotheses  to  illustrate  the  results  of  this  section. 

For  an  event  like  a  drastic  oil  price  increase,  a  smooth  adjustment  of  the 
general  economic  conditions  seems  more  plausible  than  a  discrete  change. 
Therefore  the  intercept  version  of  an  intervention  model  is  chosen  with  n\t  = 
1,  Ti2t  =  0,  for  t  <  1978.4  and  n\t  =  0,ri2t  =  1  for  t  >  1979.1.  Because  a 
VAR(2)  model  performed  reasonably  well  in  Chapter  4  for  the  period  1960- 
1978  we  use  VAR(2)  processes  for  both  subperiods.  This  choice  is  plausible 
under  the  null  hypothesis  of  no  structural  change  after  1978. 

We  first  test  a  stationary  model  (H2)  against  one  where  all  parameters  are 
allowed  to  vary  (Hi).  The  resulting  value  of  the  LR  statistic  is  A lr  =  64.11. 
From  Table  17.1  we  have 

(s-l)K[K(p  +  §)  +  §]  =27 

degrees  of  freedom  because  s  =  2,  K  =  3,  and  p  =  2.  Hence,  we  can  reject  the 
null  hypothesis  of  time  invariance  at  the  1%  level  of  significance  (y2(27).gg  = 
46.96).  This  result,  of  course,  does  not  necessarily  mean  that  all  coefficients 
are  really  time  varying.  For  instance,  the  error  covariance  matrix  may  be  time 
invariant  while  the  other  coefficients  vary.  To  check  this  possibility,  we  test 
i/3  against  Hi  (all  coefficients  time  varying).  The  value  of  the  LR  statistic 
becomes  A  lr  =  33.46  and  the  number  of  degrees  of  freedom  for  this  test  is  6. 
Thus,  the  test  value  exceeds  the  critical  value  of  the  ^-distribution  for  a  1% 
significance  level  (x2(6).gg  =  16.81)  and  we  reject  the  null  hypothesis.  Further 
tests  on  the  data  are  possible  and  the  reader  is  invited  to  carry  them  out. 
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It  is  perhaps  worth  pointing  out  that  we  have  only  four  years  of  data  or 
16  observations  for  each  variable  after  the  potential  structural  change.  The 
quality  of  the  ^-approximations  to  the  distributions  of  the  LR  statistics  is 
therefore  doubtful  in  the  present  case,  as  discussed  in  Chapter  4,  Section  4.6.1. 
In  that  section,  we  found  that  a  bootstrap  version  of  a  Chow  test  of  H2  against 
H 3  may  result  in  a  different  conclusion  than  the  use  of  asymptotic  critical 
values.  Clearly,  similar  results  are  conceivable  for  the  example  considered  in 
this  section. 


17.4.4  Extensions  and  References 


Although  we  have  used  the  label  “intervention”  for  the  type  of  change  that 
occurs  in  the  models  considered  in  the  previous  subsections,  they  could  also 
be  regarded  as  outliers  if,  for  instance,  a  change  in  the  process  mean  occurs 
for  a  small  number  of  periods  only.  Tsay  (1988)  discussed  univariate  time 
series  models  with  outliers  and  structural  changes  and  listed  a  number  of 
further  references.  By  appropriate  choice  of  the  dummy  variables  nu,  it  is 
possible  to  combine  periodic  and  intervention  or  outlier  models.  Extensions 
of  the  present  framework  to  VARMA  or  restricted  VAR  models  are  possible 
in  principle.  Moreover,  cointegrated  VAR  models  with  structural  shifts  were 
already  mentioned  in  Chapter  8  in  the  context  of  testing  for  the  cointegration 
rank. 

More  general  forms  of  interventions  in  the  process  mean  were  discussed 
by  Box  &  Tiao  (1975)  and  Abraham  (1980).  They  assumed  that  inter¬ 
ventions  have  occurred  at  t  =  T\ , . . . ,  and  they  define  a  vector  It  = 
(Jt(Xi), . . . ,  It(Tk))'  of  dummy  variables  that  may  be  of  the  type 


h(rA) 


0  for  t  <  rl\, 

1  for  t  >  1 , 


or  of  the  type 


h{Ti) 


0  for  t  ^  T.h 
1  for  t  =  1  \ . 


They  model  the  interventions  as  R(L)It ,  where  R{L)  is  a  matrix  of  rational 
functions  in  the  lag  operator. 

Further  complications  arise  if  the  time  of  the  break  is  unknown  and  has 
to  be  estimated  in  addition  to  the  VAR  coefficients.  This  case  was  discussed 
by  Bai  (1994),  Bai,  Lumsdaine  &  Stock  (1998),  and  Liitkepohl  et  al.  (2004), 
among  others. 


17.5  Exercises 

Problem  17.1 

Suppose  yt  is  a  periodic  If -dimensional  VAR(l)  (PAR(l))  process, 
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Vt  =  v i  +  Anj/t_i  +  ut ,  E(utu't )  =  Ui,  if  t  is  even, 
and 

Vt  =  ^2  +  A^yt-i  +  ut,  E(utu't )  =  £2,  if  i  is  odd. 

(a)  Derive  explicit  expressions  for  the  means  /it  and  the  matrices 

(b)  Derive  the  autocovariances  E[(j/t  —  fit)(yt-h  —  Ht-h)']  for  h  =  1,2,3,  for 
both  cases,  t  even  and  t  odd.  Write  down  explicitly  the  assumptions  used 
in  deriving  the  autocovariance  matrices. 

Problem  17.2 

Assume  that  the  process  yt  given  in  Problem  17.1  is  bivariate  with 


and 

A\2  = 


.6  .4  ' 

.8  .5  ' 


Is  the  corresponding  process  t)T  =  (y'2T,  V^t- iX  stable? 

Problem  17.3 

Give  the  forecasts  yt(h),  h  =  1, 2, 3,  t  odd,  for  the  process  from  Problem  17.1 
and  derive  explicit  expressions  for  the  forecast  MSE  matrices. 

Problem  17.lt 

For  the  process  given  in  Problem  17.1,  construct  an  LM  test  of  the  hypotheses 
Hq  :  v\  =  v-i,  An  =  A\2 ,  Ei  =  E2 
against 

Hi  :  vi  =  V2,  An  =  A12,  Ei  ^  i?2- 
Provide  an  explicit  expression  for  the  LM  statistic. 

Problem  17.5 

Suppose  the  process  from  Problem  17.1  is  in  operation  until  period  1\  and 
after  that  another  periodic  VAR(l)  process  of  the  same  type  but  with  different 
coefficients  generates  a  set  of  variables.  Define  dummy  variables  in  such  a  way 
that  the  complete  process  can  be  written  in  the  form  (17.1.3)-(17.1.5). 

Problem  17.6 

Show  that  (17.4.6)  holds.  (Hint:  See  Chapter  3,  Section  3.3.) 

Problem  17.7 

In  1974  the  Deutsche  Bundesbank  officially  changed  its  monetary  policy  and 
started  targeting  the  money  stock.  Use  the  two  interest  rate  series  given  in 
File  E5  and  test  for  an  intervention  after  1974  within  the  framework  discussed 
in  Section  17.4.  Use  tests  for  different  types  of  interventions  related  to  shifts 
in  the  intercept  terms,  the  VAR  coefficients,  and  the  white  noise  covariances. 
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State  Space  Models 


18.1  Background 

State  space  models  may  be  regarded  as  generalizations  of  the  models  consid¬ 
ered  so  far.  They  have  been  used  extensively  in  system  theory,  the  physical 
sciences,  and  engineering.  The  terminology  is  therefore  largely  from  these 
fields.  The  general  idea  behind  these  models  is  that  an  observed  (multiple) 
time  series  y\ , . . . ,  yr  depends  upon  a  possibly  unobserved  state  zt  which  is 
driven  by  a  stochastic  process.  The  relation  between  yt  and  zt  is  described  by 
the  observation  or  measurement  equation 


Ut  =  HtZt  +  vt,  (18.1.1) 

where  Ht  is  a  matrix  that  may  also  depend  on  the  period  of  time,  t,  and  vt 
is  the  observation  error  which  is  typically  assumed  to  be  a  noise  process.  The 
state  vector  or  state  of  nature  is  generated  as 


zt  =  +  wt- 1  (18.1.2) 

which  is  often  called  the  transition  equation  because  it  describes  the  transition 
of  the  state  of  nature  from  period  t  —  1  to  period  t.  The  matrix  is  a 
coefficient  matrix  that  may  depend  on  t  and  Wt  is  an  error  process.  The 
system  (18.1.1) /(18. 1.2)  is  one  form  of  a  state  space  model. 

The  following  example  from  Meinhold  &  Singpurwalla  (1983)  may  illus¬ 
trate  the  related  concepts.  Suppose  we  wish  to  trace  a  satellite’s  orbit.  The 
state  vector  zt  may  then  consist  of  the  position  and  the  speed  of  the  satel¬ 
lite  in  period  t  with  respect  to  the  center  of  the  earth.  The  state  cannot  be 
measured  directly  but,  for  example,  the  distance  from  a  certain  observatory 
may  be  measured.  These  measurements  constitute  the  observed  vectors  yt-  As 
another  example,  consider  the  income  of  an  individual  which  may  depend  on 
unobserved  factors  such  as  intelligence,  special  abilities,  special  interests  and 
so  on.  In  this  case,  the  state  vector  consists  of  the  variables  that  describe  the 
abilities  of  the  person  and  yt  is  his  or  her  observed  income. 
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The  reader  may  recall  that  all  the  models  considered  so  far  have  been 
written  in  a  form  similar  to  (18.1.1)/ (18.1.2)  at  some  stage.  For  instance,  our 
standard  (zero  mean)  VAR(p)  model  can  be  written  in  VAR(l)  form  as 

Yt  =  AV,  A  +  Ut,  (18.1.3) 


where 


“ 

ut 

yt 

0 

Y,  := 

. 

yt—p+i  \ 

,  ut~ 

_  0  _ 

Defining  zt  :=  Yt,  Bt  :=  A,  and  wt~ i  :=  Ut,  Equation  (18.1.3)  may  be 
viewed  as  the  transition  equation  of  a  state  space  model.  The  corresponding 
measurement  equation  is 

//,  -  Ir  :():•••:  0]K 

with  Ht  :=  [IK  :  0  :  •  •  •  :  0]  and  vt  :=  0. 

In  the  next  section,  we  will  introduce  a  slightly  more  general  version  of  a 
state  space  model,  we  will  review  many  of  the  previous  models,  and  we  will 
cast  them  into  state  space  form.  As  we  have  seen,  the  representations  of  the 
models  used  in  the  previous  chapters  are  useful  for  many  purposes.  There 
are  occasions,  however,  where  a  state  space  representation  makes  life  easier. 
We  have  actually  used  state  space  representations  of  some  models  without 
explicitly  mentioning  this  fact.  We  will  also  consider  some  further  models 
that  have  been  discussed  in  the  literature  and  which  may  be  set  up  as  special 
cases  of  state  space  models.  Thereby  we  will  give  an  overview  of  a  number 
of  important  models  that  have  been  considered  in  the  multiple  time  series 
literature. 

In  Section  18.3,  we  will  discuss  the  Kalman  filter  which  is  an  extremely 
useful  tool  in  the  analysis  of  state  space  models.  Given  the  observable  vectors 
yt,  it  provides  estimates  of  the  state  vectors  and  measures  of  the  precision 
of  these  estimates.  In  a  situation  where  the  state  vector  consists  of  unob¬ 
servable  variables,  such  estimates  may  be  of  interest.  In  a  system  such  as 
(18.1.1)/ (18.1.2),  the  matrices  Bt  and  Ht  and  the  covariance  matrices  of  Vt 
and  Wt  will  often  depend  on  unknown  parameters.  The  Kalman  filter  is  also 
helpful  in  estimating  these  parameters.  This  issue  will  be  discussed  in  Section 
18.4. 

In  this  chapter,  we  will  just  give  a  brief  introduction  to  some  basic  con¬ 
cepts  related  to  state  space  models  and  the  Kalman  filter.  Various  textbooks 
exist  that  provide  broader  introductions  to  the  topic  and  a  more  in-depth  dis¬ 
cussion.  Examples  are  Jazwinski  (1970),  Anderson  &  Moore  (1979),  Hannan 
&  Deistler  (1988),  Aoki  (1987),  and  Harvey  (1989). 
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18.2.1  The  Model  Setup 
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As  mentioned  in  the  previous  section,  a  state  space  model  consists  of  a  tran¬ 
sition  or  system  equation 


Zt+i  —  +  FtXt  +  u>t,  t  —  0, 1, 2, ... , 

or,  equivalently, 

Zt  =  +  Wt-i,  t=  1,2,...,  (18.2.1) 

and  a  measurement  or  observation  equation 

yt  =  Utzt  +  Gt.Xt  +  Vt,  t  =  1,2,....  (18.2.2) 

Here 

yt  is  a  ( K  x  1)  vector  of  observable  output  or  endogenous  variables, 

Zt  is  an  ( N  x  1)  state  vector  or  the  state  of  nature , 

Xt  is  an  (M  x  1)  vector  of  observable  inputs  or  instruments  or  policy 
variables, 

vt  is  a  ( K  x  1)  vector  of  observation  or  measurement  errors  or  noise, 
wt  is  an  ( N  x  1)  vector  of  system  or  transition  equation  errors  or  noise, 
Hf  is  a  ( K  x  N)  measurement  matrix, 

G  t  is  a  (K  x  M)  input  matrix  of  the  observation  equation, 
is  an  (N  x  N)  transition  or  system  matrix, 

and 

Ft  is  a  ( N  x  M)  input  matrix  of  the  transition  equation. 

The  matrices  Ht,  Gt,  Bf,  and  Ft  are  assumed  to  be  known  at  time  t.  Although 
they  are  in  general  allowed  to  vary,  at  least  some  of  them  will  often  be  time 
invariant.  In  practice,  at  least  some  of  the  elements  of  these  matrices  are 
usually  unknown  and  have  to  be  estimated.  This  issue  is  deferred  to  Section 
18.4.  It  is  perhaps  noteworthy  that  the  process  generating  the  zfs  and,  hence, 
also  the  yfs  is  assumed  to  be  started  from  an  initial  state  zq  and  a  given 
initial  input  Xq. 

To  complete  the  description  of  the  model,  we  make  the  following  stochastic 
assumptions  for  the  noise  processes  and  the  initial  state: 

The  joint  process 

wt 

vt 


is  a  zero  mean,  serially  uncorrelated  noise  process  with  possibly  time  varying 
covariance  matrices 
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^Wt  ZjWtVt 

y  V 

^VtWt  ^Vt 

The  initial  state  zq  is  uncorrelated  with  Wt,Vt  for  all  t  and  has  a  distribution 
with  mean  yZl)  and  covariance  matrix  EZu.  The  input  sequence  xo,x±,...  is 
assumed  to  be  nonstochastic  for  simplicity.  If  the  observed  inputs  are  actually 
stochastic,  the  analysis  is  assumed  to  be  conditional  on  a  given  sequence  of 
inputs. 

With  these  assumptions  we  can  derive  stochastic  properties  of  the  states 
and  the  system  outputs.  Successive  substitution  in  (18.2.1)  implies 

t 

Zt  =  &t,tZo  +  ^2  +  wt-i),  (18.2.3) 

i- 1 


where 

i 

$o  ,t-=lN  and  3>i>t  :=  H  B#-j>  *  =  1,2,... 

j= i 

(see  also  Section  17.2.1).  Hence, 

t 

l^zt  :=  E(zt)  =  &t,tHz0  +  &i-i,tFt.-iXt-i  (18.2.4) 

i= 1 


and 

Cov(zt,  zt+h)  =  E[(zt  -  fJ,Zt)(zt+h  -  Hzt+h)'] 

t 

=  ^t,t^z„^t+h,t+h  +  ^{-^>t^t-i^h+i-l,t+h-  (18.2.5) 

i- 1 

Under  the  aforementioned  stochastic  assumptions,  it  is  also  easy  to  derive  the 
means  and  covariance  matrices  of  the  output  process: 

/* yt  :=  E(yt)  =  H tE(zt)  +  Gtxt 

and 


Co v(yu  yt+h )  =  HtCov(zt,  zt+h)ii't  for  h  ±  0. 

Generally,  the  means  and  autocovariances  of  the  yt  s  are  obviously  not  time 
invariant.  Thus,  in  general,  yt  is  a  nonstationary  process. 

We  will  now  consider  various  special  cases  of  state  space  models  which 
are  obtained  by  specific  definitions  of  the  state  vector,  the  inputs,  the  noise 
processes,  and  the  matrices  Ht,  Gt,  Bt,  and  Ft.  These  matrices  and  the  noise 
covariance  matrices  will  often  not  depend  on  t,  in  which  case  we  will  suppress 
the  subscript  for  notational  simplicity. 
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A  Finite  Order  VAR  Process 

Although  we  have  mentioned  earlier  how  to  cast  a  VAR(p)  process, 

Ut  =  v  +  Aiyt-i  +  •  •  •  +  Apyt-p  +  Ut,  (18.2.6) 

in  state  space  form,  it  may  be  useful  to  consider  this  model  again  because  it 
illustrates  that  often  different  state  space  models  can  represent  a  particular 
process.  One  possible  state  space  representation  is  obtained  by  defining 


_ 

V 

yt 

0 

Yt  ■■= 

yt—p+i  \ 

,  v  :  = 

_  0  _ 

"  Hi  . 

.  .  Arp—  1 

Ap  1 

Ut 

A  := 

Ik 

0 

0 

,  Ut  ■■= 

0 

0  . 

..  I K 

0 

0 

Hence, 

Yt  =  A  yt_!  +  v  +  Ut, 
!U  ~  Ir  :():•••:  0 ]Yt 


(18.2.7) 


(18.2.8) 

(18.2.9) 


is  a  state  space  model  with  state  vector  zt  :=  Yt,  B  :=  A,  F  :=  v,  xt  :=  1, 
wt  :=  Ut+ 1,  H  :=  [IK  :  0  :  •  •  •  :  0],  G  :=  0,  vt  :=  0. 

An  alternative  possibility  is  to  define  the  state  vector  as 


1 

Vt 

zt  ■= 

yt—p+ 1 

and  choose 


"  1 

0  . 

0 

0  ' 

0 

V 

A!  . 

•  •  Ap— i 

Ap 

Ut  + 1 

B  := 

0 

Ik 

0 

0 

and  Wt  := 

0 

0 

0  . 

..  I K 

0  1 

0 

so  that 


zt+ 1  =B3t  +  Wt 
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and 

Vt  =  [ 0:1K  :0:  :0]zt, 

which  describes  the  same  process  as  (18.2.8)/ (18.2.9).  It  may  be  worth  point¬ 
ing  out  that  in  the  present  framework,  the  process  is  assumed  to  be  started  at 
time  t  =  1  with  initial  values  zq  =  [1,  y'0, . . . ,  y'_p+i]r,  while  we  have  assumed 
an  infinite  past  of  the  process  in  some  previous  chapters. 


A  VARMA(p,  q)  Process 

One  state  space  representation  of  the  VARMA(p,  q)  process 

Ut  =  v  +  Myt^i  +  •  •  •  +  Apijt-p  +  ut  +  MlUt_i  +  •  •  •  +  Mqut-q  (18.2.10) 

is  known  from  Chapter  11,  Section  11.3.2.  It  is  obtained  by  choosing  a  state 
vector 


Vt 


ut+i 

0 


zt  ■= 


Vt—p+ 1 

ut 


transition  noise  wt  '■= 


0 

ut+ 1 
0 


L  ut-q+ 1  J 


0 


an  input  sequence  Xt  '■=  1  as  in  (18.2.8),  B  :=  A  from  Chapter  11,  Equation 
(11.3.8),  F  :=  u  defined  similarly  as  in  (18.2.8),  H  :=  [IK  :  0  :  •  •  •  :  0],  G  :=  0, 
and  vt  :=  0.  For  many  purposes,  this  form  is  not  the  most  useful  state  space 
representation  of  a  VARMA  model.  Other  state  space  representations  are 
given  by  Aoki  (1987),  Hannan  &  Deistler  (1988),  and  Wei  (1990). 


The  VARX  Model 

The  VARX  model 

Vt  =  Aiyt-i  +  •  •  •  +  Apyt-p  +  Bf)Xt  +  •  •  •  +  BsXt-s  +  itt  (18.2.11) 

considered  in  Chapter  10  is  easily  cast  in  state  space  form  by  choosing  the 
state  vector 

Vt 

^ _  yt-p+i 
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and  the  transition  equation 


r  Ax  . 

•  •  Ap- 1 

Ap 

1 

Bx  . 

•  •  Bs_x 

Bs  1 

i 

0 

0 

1 

0  . 

.  0 

0 

0  . 

..  I 

0 

1 

1 

0  . 

.  0 

0 

1 

0  . 

.  0 

0 

0 

1 

I 

0 

0 

1 

1 

0  . 

.  I 

0 

The  corresponding  observation  equation  is 
Vt  —  [Ik  :  0  :  •  •  •  :  0 \zt. 


(18.2.12) 


It  is  also  possible  to  extend  the  model  so  as  to  allow  for  a  finite  order  MA(q) 
error  process  in  (18.2.11)  (see  Problem  18.1). 


Systematic  Sampling  and  Aggregation 

Suppose  that  annual  data  is  available  whereas  a  decision  maker  is  interested 
in,  say,  quarterly  figures.  Let  Vit  be  an  (M  x  1)  vector  of  variables  associated 
with  the  *-th  quarter  of  year  t  and  suppose  the  vector  of  all  quarterly  variables 
associated  with  year  t, 


is  generated  by  the  VAR(p)  process 
Vt  =  Airjt-i  +  •  •  •  +  Apr]t-p  +  Ut- 
Then  we  may  define  a  state  vector 
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zt  ■= 


Vt 


L  Vt-p+i  J 

and  a  transition  equation 


Zt  —  A.Zt-i  +  Ut, 

where  A  and  Ut.  are  the  same  quantities  as  in  (18.2.7).  If  the  yearly  values 
are  obtained  by  adding  (aggregating)  the  quarterly  figures,  the  observation 
equation  is 


Ut  —  [Im  '■  Im  '■  Im  '■  Im  :  0  :  •  •  •  :  OJ-Z*, 

where  M  is  the  dimension  of  'iju  ■  Alternatively,  if  the  annual  figures  are  ob¬ 
tained  by  systematic  sampling,  that  is,  by  taking,  say,  the  fourth  quarter 
values  as  the  annual  figures,  the  observation  equation  is 

Vt=  [0:0:0  :  7M  :  0  :  •  •  •  :  0]zt. 

Extensions  of  this  framework  to  the  case  where  Vt.  is  generated  by  a 
VAR.MA  or  VARX  process  are  straightforward.  For  applications  of  state  space 
models  in  aggregation  and  systematic  sampling  problems  see  Nijman  (1985), 
Harvey  (1984),  Harvey  &  Pierse  (1984),  Jones  (1980),  Ansley  &  Kohn  (1983). 

The  examples  considered  so  far  have  in  common  that  the  system  matrices 
H,  G,  B,  and  F  are  all  time  invariant  and  the  state  vector  consists  of  at 
least  some  observed  or  observable  variables.  In  contrast,  the  state  vector  is 
unobservable  in  the  next  two  examples  while  the  system  matrices  remain  time 
invariant. 

Structural  Time  Series  Models 

In  a  structural  time  series  model,  the  observed  time  series  is  viewed  as  a  sum 
of  unobserved  components  such  as  a  trend,  a  seasonal  component,  and  an 
irregular  component  (see,  e.g.,  Kitagawa  (1981),  Harvey  &  Todd  (1983),  Har¬ 
vey  (1989)).  For  instance,  for  a  univariate  time  series  j/i, . . . ,  j/r,  the  structural 
model  may  have  the  form 


Vt  —  Vt  +  It  +  ut,  (18.2.13) 

where  /it  is  a  trend  component  and  7 1  is  a  seasonal  component.  Harvey  &  Todd 
(1983)  assume  a  local  approximation  to  a  linear  trend  function  for  which  both 
the  level  and  the  slope  are  shifting.  They  postulate  a  process 

Ut  =  Vt-i  +  Pt- 1  +  Vt  with  pt  =  Pt-i  +  6  (18.2.14) 

as  the  trend  generation  mechanism.  Here  Vt  and  are  assumed  to  be  white 
noise  processes.  This  trend  model  is  a  mixture  of  two  random  walks  which  are 
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discussed  in  Chapter  6,  Section  6.1.  For  the  seasonal  component,  it  is  assumed 
that  the  sum  over  the  seasonal  factors  of  a  full  year  is  approximately  zero, 

s-1 

7 1  =  -  +ujt,  (18.2.15) 

i= i 


where  s  is  the  number  of  seasons  and  u>t  is  white  noise.  The  three  white  noise 
processes  yt,  ft,  and  u>t  are  assumed  to  be  independent. 

This  model  can  be  set  up  in  state  space  form  by  defining  the  state  vector 
to  be 


17 

Pt 


lt-s+2 


and,  hence,  the  transition  equation  is 


1 

1 

0 

Vt 

0 

1 

6 

’ 

-1  . 

7.~  -f ' 

tot 

1 

0 

0 

Zt  —  1  + 

0 

0 

1 

0  . 

i 

0  J 

0 

The  corresponding  measurement  equation  is 
yt  =  [1, 0, 1, 0, ... ,  0]zt  +  ut. 


(18.2.16) 


(18.2.17) 


It  may  be  worth  noting  that  these  models  can  be  seen  to  describe  special 
integrated  AR.MA  processes.  Hence,  it  is,  of  course,  not  surprising  that  they 
can  be  cast  in  state  space  form.  Multivariate  generalizations  of  this  model  are 
possible  (see  Harvey  (1987)  and  Proietti  (2002)). 


Factor  Analytic  Models 

In  a  classical  factor  analytic  setting,  it  is  assumed  that  a  set  of  I\  observed 
variables  yt  depends  linearly  on  N  <  K  unobserved  common  factors  ft  and 
on  individual  or  idiosyncratic  components  iq.  In  other  words, 


yt  =  Lft  +  ut,  (18.2.18) 

where  L  is  a  ( K  x  N)  matrix  of  factor  loadings  and  the  components  of  iq 
are  typically  assumed  to  be  uncorrelated,  that  is,  Eu  is  a  diagonal  matrix 
(Anderson  (1984),  Morrison  (1976)).  One  objective  of  a  factor  analysis  is  the 
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construction  or  estimation  of  the  unobserved  factors  ft-  We  may  view  (18.2.18) 
as  the  measurement  equation  of  a  state  space  model  and,  if  the  factors  ft  and 
fs  are  independent  for  t  ^  s,  we  may  specify  a  trivial  transition  equation 
ft  =  Wt- 1- 

However,  if  yt  consists  of  time  series  variables,  it  may  be  more  reasonable 
to  assume  that  the  factors  are  autocorrelated.  For  example,  they  may  be 
generated  by  a  VAR  or  VARMA  process.  Also  the  idiosyncratic  components 
ut  may  be  autocorrelated.  Dynamic  factor  analytic  models  of  this  type  were 
considered,  for  instance,  by  Sargent  &  Sims  (1977),  Geweke  (1977),  and  Engle 
&  Watson  (1981).  Assuming  that 

ft  =  Aift-i  +  •  •  •  +  Apft-p  +  rjt 


and 


Ut  —  CiUt-l  +  •  •  •  +  Cqllt-q  +  £j, 

where  rjt  and  et  are  white  noise  processes,  a  state  space  model  can  be  set  up 
by  specifying  a  state  vector, 

ft 

* _  ft-p+% 

%t  •—  , 

ut 

'Ut—q-\-l 

and  a  transition  equation 


Zt 


Ai  . . .  Ap—i  Ap 

I  0  0 

0  ...  I  0 

0 


0 


C\  ...  GVi  cq 
\  I  0  0 


0  ...  /  0 


Vt 

0 

Zt- 1  + 

0 

et 

0 

_  0  _ 

(18.2.19) 


The  corresponding  measurement  equation  is 


yt  =  Lft  +  ut  =  [L  :  0  :  •  •  •  :  0  :  IK  :  0  :  •  •  •  :  0 \zt. 


(18.2.20) 


An  extension  to  the  case  where  ft,  and  ut  are  generated  by  VARMA  pro¬ 
cesses  is  left  to  the  reader  (see  Problem  18.2).  If  exogenous  variables  are  added 


18.2  State  Space  Models  621 


to  the  original  model  (18.2.18)  and,  in  addition,  the  factors  are  dynamic  pro¬ 
cesses,  we  obtain  the  dynamic  MIMIC  models  of  Engle  &  Watson  (1981).  More 
recent  references  on  dynamic  factor  models  include  Stock  &  Watson  (2002a, 
b)  and  Forni,  Hallin,  Lippi  &  Reichlin  (2000). 

In  all  the  previous  examples  the  system  matrices  Ht,  Gt,  Bf,  and  Ft  are 
time  invariant.  We  will  now  consider  models  where  at  least  some  elements  of 
these  matrices  vary  through  time. 


VARX  Models  with  Systematically  Varying  Coefficients 

We  extend  the  varying  coefficients  VAR  models  of  Chapter  17  slightly  by 
adding  further  “exogenous”  variables  and  assuming  that  a  given  multiple  time 
series  is  generated  according  to 

Ut  =  Ai'tyt.-i  +  •  •  •  +  Apttyt-p  +  FtXt  +  ut .  (18.2.21) 

The  vector  ay  may  simply  include  an  intercept  term  or  seasonal  dummies.  It 
may  also  include  other  deterministic  terms  and  even  lags  of  exogenous  vari¬ 
ables.  Because  we  are  assuming  that  the  input  variables  of  the  state  space 
model  are  nonstochastic,  we  restrict  ay  to  be  a  deterministic  sequence,  how¬ 
ever.  Using 


yt 

~Aht  . 
I 

■  •  Ap— i,t  Apt 

0  0 

Yt  := 

. 

yt—p+i  \ 

,  At:= 

0  . 

I  0 

‘  Ft  ' 

ut 

0 

0 

Ft  :  = 

_  0  1 

,  and  wt- 1  := 

0 

gives  a  transition  equation 

Yt  =  AtFt_i  +  Ftxt  +  wt-i  (18.2.22) 


and  a  measurement  equation 

!h  -  /k  :():•••:()>,.  (18.2.23) 

Obviously,  the  transition  matrix  Bt_i  :=  At  and  the  input  matrix  F(  of  the 
transition  equation  may  be  time  varying  in  this  state  space  model. 
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Random  Coefficients  VARX  Models 

So  far  all  the  original  models  either  have  time  invariant,  constant  coefficients 
or,  as  in  the  previous  example,  systematically  varying  coefficients.  We  will 
now  consider  models  with  random  coefficients  and  demonstrate  how  they  can 
be  cast  in  state  space  form.  Let  us  begin  with  a  simple  multivariate  regression 
model  of  the  form 


yt  =  Ctxt  +  vt  =  (. x't  (g)  I)vec(Ct)  +  vt.  (18.2.24) 

Assuming  that  the  parameter  vector  jt  :=  vec (Ct)  is  generated  by  a  VAR(q) 
process, 


It  =  u  +  ' ' '  +  Bqjt_q  +  ut , 

we  may  define  the  state  vector  as 


(18.2.25) 


zt  ■= 


It 


L  ~1t—q+ 1  J 


and  get  a  state  space  model  with  the  following  transition  and  measurement 
equations,  respectively: 


■  Bi  . 

b3 

•O 

1 

to 

["I 

ut 

Zt  = 

I 

0  0 

Zt- 1  + 

0 

+ 

0 

0  . 

..I  0  1 

0  1 

0 

yt  =  [x't  (gi  I  :  0  :  ■  •  •  :  0 ]zt  +  vt. 
Obviously,  the  measurement  matrix, 


Ht  :=  [x't®I:0  :  •••  :  0], 


may  be  time  varying.  It  may,  in  fact,  be  random  if  the  Xt.  are  stochastic  vari¬ 
ables.  Such  an  assumption  is  mandatory  if  xt  contains  lagged  yt  variables.  To 
see  this  point  more  clearly,  let  us  explicitly  introduce  lagged  yt’ s  in  (18.2.24): 

yt  =  AtTt-i  +  CtXt  +  Vt 

=  {Yt-i  ®  /)vec(At)  +  {x't  ®  /)vec(Ct)  +  vt,  (18.2.26) 

where  A;  :=  [Alt, . . . ,  Apt]  and  Y^_1  :=  [^_1; . . . ,  y't_p\.  Now  suppose  that 
7t  :=  vec(Ct)  is  generated  by  the  VAR(<7)  process  (18.2.25)  and  att  =  vec(At) 
is  driven  by  a  VAR(r)  process 


att  —  Diatt-i  +  •  •  •  +  D.ratt—T  +  Vt, 


(18.2.27) 
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which  is  assumed  to  be  independent  of  jt.  Defining  the  state  vector  as 


7 1 


Zt  ■= 


7t-g+l 

at 


CX-t-r+l 


the  following  state  space  model  is  obtained: 


B i  . 

•  •  Bq- 1 
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(18.2.28) 


yt  =  [x't  ®  I  :  0  :  •  •  •  :  0  :  Y^_x  ®  I  :  0  :  ■  •  •  :  0 \zt  +  vt.  (18.2.29) 

Further  extensions  of  this  model  are  possible.  For  instance,  c*t  and  jt 
may  be  individually  or  jointly  generated  by  a  VARMA  rather  than  a  finite 
order  VAR  process.  Moreover,  input  variables  with  constant  coefficients  could 
appear  in  (18.2.26).  These  extensions  are  left  to  the  reader  (see  Problem  18.3). 

The  number  of  publications  on  random  coefficients  models  is  vast  in  both 
the  econometrics  and  the  time  series  literature.  Famous  examples  from  the 
earlier  econometrics  literature  on  the  topic  are  Hildreth  &  Houck  (1968), 
Swamy  (1971),  and  Cooley  &  Prescott  (1973,  1976).  Surveys  of  the  earlier 
literature  were  given  by  Chow  (1984)  and  Nicholls  &  Pagan  (1985).  Both  of 
these  articles  include  extensive  reference  lists.  For  a  more  recent  overview  see 
also  Swamy  &  Tavlas  (2001).  On  the  time  series  side,  a  number  of  references 
can  be  found  in  the  monograph  by  Nicholls  &  Quinn  (1982).  Other  important 
work  on  the  topic  includes  the  article  by  Doan  et  al.  (1984)  who  investigated 
the  potential  of  random  coefficients  VAR  models  with  Bayesian  restrictions 
for  econometric  time  series  analysis. 
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18.2.2  More  General  State  Space  Models 

There  are  also  time  series  models  that  do  not  fall  into  the  state  space  frame¬ 
work  considered  so  far.  Therefore  it  may  be  worth  pointing  out  that  more 
general  nonlinear  state  space  models  have  been  studied  in  recent  publications. 
A  very  general  setup  has  the  form 

zt+ 1  =  b t(zt,  xu  wt,  5i)  (18.2.30) 

for  the  transition  equation  and 


Vt  =  h t(zt,  xt,  vt,  S2)  (18.2.31) 

for  the  measurement  equation.  In  other  words,  the  functional  dependence 
between  the  inputs,  the  states,  and  the  output  variables  may  be  of  a  general 
nonlinear  form  and  also  the  transition  from  one  state  to  the  next  is  described 
by  a  more  general  function  than  previously.  Here  and  £2  are  vectors  of 
parameters. 

Bilinear  time  series  models  are  examples  for  which  the  linear  state  space 
framework  is  too  narrow.  A  very  simple  univariate  bilinear  time  series  model 
has  the  form 


Vt  =  atyt-i  +  ut  +  Pyt-iut-i, 

where  itt  is  univariate  white  noise.  The  product  term  (3yt-iUt-i  distinguishes 
this  model  from  a  linear  specification.  Bilinear  models  have  been  found  useful 
in  modelling  nonnormal  phenomena  (see,  e.g.,  Granger  &  Andersen  (1978)). 

A  more  general  multivariate  bilinear  time  series  model  may  be  specified 
as  follows: 

Vt  =  Aiyt-i  +  •  •  •  +  Apijt-p  +  Ut  +  M±Ut-i  +  •  •  •  +  Mqut~q 

r  s 

+  ^2^2cijvec(yt-iu't_j).  (18.2.32) 

i=  1  3  =  1 

Assuming,  without  loss  of  generality,  that  p  >  r  and  q  >  s  and  defining 
Vt 

. _  yt-p+i 


L  Ut-g+l  J 
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and  a  matrix  C  which  contains  the  elements  of  the  Cl3  matrices  in  a  suitable 
arrangement,  we  get  a  bilinear  state  space  model  of  the  form 


zt+i  =  B  zt  +  wt  +  Cvec  (ztz't),  (18.2.33) 

Vt  =  [Ik  :  0  :  •  •  •  :  0]zt.  (18.2.34) 

Obviously,  the  transition  equation  involves  a  nonlinear  term,  namely  vec (ztz't). 
Hence,  (18.2.33)/(18.2.34)  is  an  example  of  a  nonlinear  state  space  system. 

The  work  of  Granger  &  Andersen  (1978)  and  others  on  univariate  bilinear 
models  has  stimulated  investigations  in  this  area.  Much  of  the  earlier  work  is 
documented  in  a  monograph  by  Subba  Rao  &  Gabr  (1984).  More  recent  work 
on  multivariate  bilinear  models  includes  Stensholt  &  Tjpstheim  (1987)  and 
Liu  (1989). 

With  all  these  examples  we  have  not  nearly  exhausted  the  range  of  models 
that  have  been  used  and  studied  in  the  recent  time  series  literature.  Important 
omissions  are  threshold  autoregressive  models  analyzed  by  Tong  (1983)  and 
exponential  autoregressive  models  introduced  by  Ozaki  (1980)  and  Haggan 
&  Ozaki  (1980).  A  general  nonlinear  model  class  was  considered  by  Priestley 
(1980)  and  reviews  of  many  nonlinear  models  and  extensive  lists  of  refer¬ 
ences  were  given  by  Priestley  (1988),  Andel  (1989),  and  Granger  &  Terasvirta 
(1993). 


18.3  The  Kalman  Filter 

The  Kalman  filter  was  originally  developed  by  Kalman  (1960)  and  Kalman 
&  Bucy  (1961).  It  is  a  tool  to  recursively  estimate  the  states  zt,  given  obser¬ 
vations  2/1, . . . ,  ut  of  the  output  variables.  Under  normality  assumptions,  the 
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estimator  of  the  state  produced  by  the  Kalman  filter  is  the  conditional  expec¬ 
tation  E(zt\yi,  . . . ,  yt )•  The  Kalman  filter  also  provides  the  conditional  covari¬ 
ance  matrix  Cov(zt|j/i, . . .  ,yt)  which  may  serve  as  a  measure  for  estimation 
or  prediction  uncertainty.  Of  course,  for  t  >  T,  the  estimator  E{zt\yi,  ■  ■  . ,  yr) 
is  a  forecast  or  prediction  at  origin  T,  in  the  terminology  of  the  previous 
chapters.  The  computation  of  the  estimators  E(zt\yi, . . . ,  yt ),  t  =  1, . . .  ,T,  is 
called  filtering  to  distinguish  it  from  the  forecasting  problem. 

In  some  of  the  examples  of  Section  18.2,  estimation  of  the  state  vectors  is 
of  obvious  interest,  for  instance,  if  the  state  vector  consists  of  time  varying 
coefficients  or  if  the  state  vector  contains  the  unobserved  factors  of  a  dynamic 
factor  analytic  model.  In  other  cases,  where  the  state  vector  is  not  of  foremost 
interest  or  where  it  consists  of  observable  variables,  the  conditional  means  and 
covariance  matrices  can  still  be  useful  in  evaluating  the  likelihood  function, 
for  example.  We  will  return  to  this  point  in  Section  18.4.  Now  the  Kalman 
filter  recursions  will  be  presented. 

18.3.1  The  Kalman  Filter  Recursions 
Assumptions  for  the  State  Space  Model 

We  assume  a  state  space  model  with  transition  equation 

zt  =  Bzt_i  +  Fxt-i  +  wt- i  (18.3.1) 

and  with  measurement  equation 

yt  =  H  t.zt  +  Gxt  +  Vt  (18.3.2) 

for  t  =  1,2,....  Note  that  both  input  matrices  and  the  transition  matrix  are 
assumed  to  be  time  invariant  and  known.  This  condition  is  satisfied  in  most 
of  the  example  models  of  Section  18.2.  The  measurement  matrices  Ht  are 
assumed  to  be  known  and  nonstochastic  at  time  t.  This  assumption  does  not 
exclude  lagged  output  variables  from  Ht  because  the  past  output  variables  are 
given  at  time  t.  The  input  sequence  Xt,  t  =  0, 1, . . .,  is  again  assumed  to  be 
nonstochastic  for  simplicity.  The  noise  processes  Wt  and  Vt  are  independent. 
They  are  both  Gaussian  with  time  invariant  covariances, 

wt  ~  A/”(0,  Ew),  t  =  0,1,..., 
vt~tf(0,Ev),  *  =  1,2, - 

Also  the  initial  state  is  Gaussian,  zq  ~  Af(/j,o,Eo),  and  it  is  assumed  to  be 
independent  of  vt,  wt- i,  t  =  1, . . ..  The  initial  state  may  be  a  constant,  non¬ 
stochastic  vector  in  which  case  Eq  =  0. 

With  the  exception  of  the  normality  assumption,  the  foregoing  conditions 
are  satisfied  for  most  of  the  example  models  of  Section  18.2  under  the  usual 
assumptions  entertained  for  these  models.  It  is  possible  to  derive  recursions 
similar  to  those  given  below  under  more  general  conditions.  If  the  normality 
assumption  is  dropped,  the  recursions  given  below  can  still  be  justified.  We 
will  return  to  this  issue  after  having  presented  them. 
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The  Recursions 

We  will  use  the  following  additional  notation  in  stating  the  Kalman  filter 
recursions: 

zt \a  ■=  E(zt\yi, . .  .,2/s), 

Ez{t\s)  :=  Cov(zt\yi,...,ys), 

Vt\s  ■=  E(yt\yu...,ya), 

Ey(t\s)  :=  Cov(yt\yi,...,ys), 


(18.3.3) 


(z\y)  ~  N(ii,  E)  means  that  the  conditional  distribution  of 

z  given  y  is  multivariate  normal  with  mean  /i 
and  covariance  matrix  E. 

Under  the  previously  stated  conditions,  the  normality  assumption  implies 


and 


(zt\yi,  ■ 

■  ■■<it  -i) 

~N(zt\t-i,Ez(t\t-l))  fort  =  2,. 

T 

•u1) 

(18.3.4) 

(zt\yi,  ■ 

•  •  ,yt)  ~ 

N{zt\t,Ez(t\t))  for  t  =  1,...,T, 

(18.3.5) 

(yt\yii  ■ 

i 

•  ■  ,2/t-i) 

~  fif(yt\t-i,Ey{t\t-  1))  fort  =  2,. 

T 

•u1) 

(18.3.6) 

(zt\yi,  ■ 

•  •  ,2/t)  ^ 

'  N{zt\T,  Ez(t\T)), 

(18.3.7) 

(yt\yi,- 

•  • ,  Vt)  ~ 

' M(yt\T,Ey(t\T ))  for  t>T. 

(18.3.8) 

The  conditional  means  and  covariance  matrices  can  be  obtained  by  the  fol¬ 
lowing  Kalman  filter  recursions  which  are  graphically  depicted  in  Figure  18.1: 

Initialization :  z0|o  :=  /to,  i7z(0|0)  :=  Eq. 

Prediction  step  (1  <  t  <  T): 

Zt\t-i  =  ^zt-i\t-i  + 

Ez(t\t  —  1)  =  B  Ez(t  —  l\t  —  1)B'  +  Ew, 

yt\t-i  =  +  Gxti 

Ey(t\t  - 1)  =  atEz{t\t  -  1)H(  +  Ev. 

Correction  step  (1  <  t  <  T): 

zt\t  =  zt\t-i  +  P  t(yt  —  2/t|t-i), 

Ez(t\t)  =  Ez(t\t  -  1)  -  Pt27„(t|t  -  1)P;, 


where 
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Pt  :=  Sz{t\t  —  l)H't£y(t\t  —  1)  1  ( Kalman  filter  gain). 

Although  the  output  variables  we  have  in  mind  have  nonsingular  distributions, 
it  may  be  worth  noting  that  if  the  inverse  of  Sy(t\t  —  1)  does  not  exist,  it 
may  be  replaced  by  a  suitable  generalized  inverse.  The  recursions  proceed  by 
performing  the  prediction  step  for  t  =  1.  Then  the  correction  step  is  carried 
out  for  t  =  1.  Then  the  prediction  and  correction  steps  are  repeated  for  t  =  2, 
and  so  on. 

Forecasting  step  (t  >  T): 
zt\T  =  B Zt-i\T  +  Fxt-1, 

Zz(t\T)  =  BZz(t-l\T)B' +  ZW, 

Ut\T  =  +  Gxt, 

Sy(t\T)  =  HtEz(t\T)H't  +  Sv. 

The  forecasting  step  may  be  carried  out  recursively  for  i  =  T  +  l,T  +  2,.,.. 

Computational  Aspects  and  Extensions 

In  practice,  in  running  through  the  Kalman  filter  recursions,  computational 
inaccuracies  may  accumulate  in  such  a  way  that  the  actually  computed  co- 
variance  matrices  are  not  positive  semidefinite.  These  and  other  computational 
issues  were  discussed  in  Anderson  &  Moore  (1979,  Chapter  6)  and  numerical 
modifications  of  the  recursions  were  suggested  that  may  help  to  overcome  the 
possible  difficulties  (see  also  Schneider  (1992)). 

As  mentioned  previously,  it  is  possible  to  justify  the  Kalman  filter  recur¬ 
sions  even  if  the  initial  state  and  the  white  noise  processes  are  not  Gaussian. 
In  that  case,  the  quantities  obtained  by  the  recursions  are  no  longer  moments 
of  conditional  normal  distributions,  however.  For  other  interpretations  of  the 
quantities  see,  for  example,  Schneider  (1988). 

Sometimes  reconstruction  of  the  state  vectors,  given  all  the  information 
j/i , ... ,  ijt ,  is  of  interest.  For  instance,  in  the  random  coefficients  models  of 
Section  18.2.1,  where  the  state  vector  zt  contains  the  coefficients  associated 
with  period  t,  one  may  want  to  estimate  the  states  and,  hence,  the  coefficients, 
given  all  the  sample  information  t/i,  •  •  • ,  2/t-  We  will  see  a  detailed  example  in 
Section  18.5.  Recursions  are  also  available  to  compute  zt\T  and  Sz{t\T)  for 
t  <  T.  The  evaluation  of  zt \t  for  t  <  T  is  known  as  smoothing.  Under  the 
previous  assumptions  (including  normality), 


(zt\yii  ■  •  • ,  Ut)  ~  N{ztyr,  Sz{t\T)) 

for  t  =  0,1,...,  T.  The  conditional  moments  may  be  obtained  recursively, 
starting  at  the  end  of  the  sample  and  moving  backwards,  that  is,  the  recursions 
proceed  for  t  =  T  —  1,  T  —  2, . . . ,  0  as  follows. 
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Fig.  18.1.  Kalman  filter  recursions. 
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Smoothing  step  (t  <  T): 

zt\T  =  zt\t  +  St(zt+1\T  ~  zt+l\t)i 

Sz(t\T)  =  £z{t\t)  -  S t[Sz(t  +  1| t)  -  £z(t  +  1|T)]S', 
where 

St  :=  £ z(t  +  l|f)  1  ( Kalman  smoothing  matrix ), 

(see  Anderson  &  Moore  (1979)). 


18.3.2  Proof  of  the  Kalman  Filter  Recursions 


The  proof  follows  Anderson  &  Moore  (1979,  pp.  39-41)  and  Meinhold  & 
Singpurwalla  (1983).  It  may  be  skipped  without  loss  of  continuity.  We  pro¬ 
ceed  inductively  and  we  use  the  following  properties  of  multivariate  normal 
distributions  (see  Propositions  B.l  and  B.2  of  Appendix  B): 

y  ~  Af(yy,  £ y),  z  ~  A f(yz,  £z)  are  independent  ( K  x  1)  random  vectors 

=>  J/  +  Z  ~  Af(fJ,y  +  /j,z,  £y  +  £z).  (18.3.9) 

If  A  is  a  fixed,  nonrandom  matrix  and  c  a  fixed  vector, 

y  ~  Af{p,y,  £y)  =>  Ay  +  c  ~  Af(Ayv  +  c,  A£yA').  (18.3.10) 


Moreover, 


■AT 


hy 


Jyz 


£. 

£y 


(z\y)  ~Af(yz  +  £zy£y 


-1 


(y  -  £z  -  £zy£y£yz).  (18.3.11) 


Here  £~l  may  be  replaced  by  a  generalized  inverse,  if  £y  is  singular. 

We  will  now  demonstrate  the  prediction  and  correction  steps  for  t  =  1. 
With  that  goal  in  mind,  we  note  that  by  (18.3.9)  and  (18.3.10)  and  the  joint 
normality  of  wq  and  v\,  the  two  vectors  z\  and  y\  are  jointly  normally  dis¬ 
tributed, 
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Hence, 


zi\o  E(z±)  —  B/io  +  Fa:o  —  Bz0|0  +  Fxo, 

27*(1|0)  :=  Cov(zi)  =  Sw  +  BA0B'  =  BAz(0|0)B'  +  Sw, 
yi\o  ■=  E(yi)  =  HiZ!|0  +  Gxi, 


27„(1|0)  :=Cov(yi)  =  H1Z’UJH,1  +  i7„  +  H1BZ,0B'H/1 
=  H1Z,z(l|0)H/1  +  Svi 

which  proves  the  prediction  step  for  t  =  1.  Using  these  results  and  (18.3.11), 
the  conditional  distribution  of  z\  given  y\  is  seen  to  be 

(zi\yi)  ~  +  Uz(l|0)H'1U.y(l|0)-1(j/1  -  y1]0), 

UZ(1|0)  -  Uz(l|0)H'1Uy(l|0)'1H1Uz(l|0)], 


which  proves  the  correction  step  for  t  =  1. 

Now  the  prediction  and  correction  steps  can  be  shown  by  induction.  Sup¬ 
pose  the  normal  distributions  in  (18.3.4)-(18.3.6)  and  the  prediction  and  cor¬ 
rection  steps  are  correct  for  t  —  1.  Then,  using  the  transition  and  measurement 
equations,  zt  and  yt  have  a  joint  normal  distribution 


zt 

yt 
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Hf 
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Hf 
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xt  + 


vt 


(Bzt_i  +  Fxt-i  +  Wt~i)  + 
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xt  + 
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By  the  induction  assumption  and  (18.3.9)/(18.3.10),  this  term  has  the  follow¬ 
ing  conditional  normal  distribution,  given  yi, . . . ,  yt- ±: 


Krf  Bzt_1|t_1  +  Fxt-i 

\  [  Hf(B Zt-^-i  +  Fart-r)  +  Gxt  \  1 

Ez(t\t-1) 

UtSz(t\t-l)  UtEz(t\t  -  l)Wt  +  Sv 

where  Sz(t]t  —  1)  =  B Uz(t  —  l|f  —  1)B'  +  Sw.  This  proves  the  prediction  step. 
Application  of  (18.3.11)  to  (18.3.12)  gives  the  conditional  distribution  of  zt 
given  yi, ...  ,yt  and  proves  the  correction  step. 

It  remains  to  prove  the  forecasting  step.  Again  by  induction  (zt\yi,  ■  ■  ■ ,  yr) 
and  (yt\yi,  ■  ■  ■ ,  j/t)  both  have  normal  distributions  with  the  first  and  second 
moments  as  stated  in  the  forecasting  step. 


(18.3.12) 


18.4  Maximum  Likelihood  Estimation  of  State  Space 
Models 

In  this  section,  we  consider  ML  estimation  of  the  state  space  system  given  in 
Section  18.3.1.  We  assume  that  the  matrices  B,  F,  Hf,  G,  Sw,  Sv,  S0,  and 
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the  vector  /io  depend  on  a  vector  of  time  invariant  parameters  6.  In  other 
words,  6  is  time  invariant,  even  if  Ht  is  not.  For  a  given  S,  the  matrices  are 
assumed  to  be  uniquely  determined  and  at  least  twice  continuously  differen¬ 
tiable  with  respect  to  the  elements  of  S.  For  instance,  in  the  state  space  model 
(18. 2. 8)/(18.2.9)  which  represents  the  finite  order  VAR  process  (18.2.6), 


vec[i/,  Au  ...,AP] 
vech(Su) 


if  no  constraints  are  placed  on  the  VAR  coefficients  or  Eu  and  if  the  initial 
conditions  y-p+ 1, . . . ,  yo  are  assumed  to  be  known  and  fixed.  The  objective  in 
this  section  is  to  estimate  6.  We  will  set  up  the  log-likelihood  function  first. 
Then  we  discuss  its  maximization  and,  finally,  the  asymptotic  properties  of 
the  ML  estimators  are  considered. 


18.4.1  The  Log-Likelihood  Function 

By  Bayes’  theorem,  the  sample  density  function  can  be  written  as 
f{yll...,yT\5)  =  /(t/i;  S)f(y2,  ■  ■  ■  ,yT[yi,  &) 


=  f(y l;  S)f(y2\yi;  $)■■■  f(yT\yi,  ■  ■  .,yT-i\S). 

Thus,  using  the  notation  of  the  previous  section  and  assuming  that  yt  has 
dimension  K ,  the  Gaussian  log-likelihood  for  the  present  case  is 

In  Z((%i,...,  yT)  =  In  /(z/i,  ■  ■  •  >  Vt\  8) 

T 

=  ln/(yi;<5)  +  ^2  ln/(yt|yi,  •  •  • ,  yt-i,  S) 

t= 2 

T 

KT  1  , 

= - —  ln(27r)  -  -^2\n\Ev(t\t  -  l)\ 

^  t=  1 

1  T 

-  yt\t-i)' Zy(t\t  -  iy\yt  -  yt\t-i), 

z  t= 1 

(18.4.1) 

where  we  have  used  that  j/i| 0  :=  E(yi),  27^(110)  :=  Cov(yi),  and 

(yt\yi,  ■  •  • ,  yt- 1)  ~  J\f(yt\t-i,  Ey(t\t  - 1)),  t  =  i,...,T, 

from  Section  18.3.1.  Here  both  yt\t-i  and  Ey(t\t  —  1)  depend  in  general  on  the 
parameter  vector  S.  If  a  specific  vector  d  is  given,  all  the  quantities  in  the  log- 
likelihood  function  can  be  computed  with  the  Kalman  filter  recursions.  Thus, 
the  Kalman  filter  is  seen  to  be  a  useful  tool  for  evaluating  the  log-likelihood 
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function  of  a  wide  range  of  models.  Note  also  that  we  have  considered  like¬ 
lihood  approximations  for  VAR.MA  processes  in  Chapter  12.  In  the  present 
framework,  the  exact  likelihood  may  be  obtained  (see  also  Solo  (1984)). 

To  simplify  the  expression  for  the  log-likelihood  given  in  (18.4.1),  we  use 
the  following  notation: 

et(8)  ■=  Vt  ~  Vt\t-i  and  Et(S)  :=  Sy(t\t  -  1).  (18.4.2) 

This  notation  makes  the  dependence  on  6  explicit.  Occasionally,  we  will,  how¬ 
ever,  drop  S.  With  this  notation,  the  log- likelihood  function  can  be  written 
as 


KT  1 

In  1(6)  =  ln(2^)-  -^[ln|At(d)|  +  efiS)1  (18.4.3) 

z  t=i 


18.4.2  The  Identification  Problem 

Recall  from  the  discussion  in  Chapter  12  that  unique  maximization  of  the 
likelihood  function  and  asymptotic  inference  require  an  identified  or  unique 
parameterization.  Identification  is  not  automatic  in  the  present  context  be¬ 
cause,  for  instance,  VAR.MA  models  are  not  identified  without  specific  restric¬ 
tions  and  VAR.MA  processes  are  just  special  cases  of  the  presently  considered 
models.  Hence,  the  identification  or  uniqueness  problem  is  inherent  in  the 
general  linear  state  space  model,  too.  We  will  state  the  problem  here  again  in 
sufficient  generality  to  cover  the  present  case. 

Let  y  :=  vec(j/i, . . . ,  yr)  be  the  vector  of  observed  random  variables  and 
denote  its  distribution  by  F{ y;  do),  where  do  is  the  true  parameter  vector.  We 
assume  that  the  true  distribution  of  y  is  a  member  of  the  parametric  family 

{P(y;d)|dGD}, 

where  D  C  R"  is  the  parameter  space.  The  vector  d0  is  said  to  be  identified 
or  identifiable  if  it  is  the  only  vector  in  D  which  gives  rise  to  the  distribution 
of  y.  In  other  words,  for  any  5i  6  D, 

di  /  d0  =>  F( y;  di)  /  F( y;  S0)  (for  at  least  one  y).  (18.4.4) 

To  compute  ML  estimators  and  to  derive  asymptotic  properties  it  is  actu¬ 
ally  sufficient  that  do  has  a  neighborhood  in  which  it  is  uniquely  determined 
by  the  true  distribution  of  y.  To  distinguish  this  case  from  one  where  unique¬ 
ness  follows  for  the  whole  parameter  space,  the  vector  d0  or  the  model  is 
often  called  locally  identified  or  locally  identifiable  if  there  exists  a  neighbor¬ 
hood  U(d0)  of  d0  such  that  (18.4.4)  holds  for  any  di  £  U(d0).  In  contrast, 
the  model  or  parameter  vector  is  globally  identifiable  or  globally  identified  if 
(18.4.4)  holds  for  all  di  £  D. 
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Because  the  negative  log-likelihood  function  has  a  locally  unique  mini¬ 
mum  if  its  Hessian  matrix  is  positive  definite,  identification  conditions  for 
state  space  models  may  be  formulated  via  the  information  matrix.  If  we  are 
interested  in  asymptotic  properties  of  estimators,  it  is  sufficient  to  obtain 
identification  in  large  samples.  Hence,  under  some  regularity  conditions,  the 
identification  assumption  may  be  disguised  in  the  requirement  of  a  positive 
definite  asymptotic  information  matrix.  In  a  later  proposition  giving  asymp¬ 
totic  properties  of  the  ML  estimators,  to  ensure  identification,  we  will  include 
the  condition  that  the  sequence  of  normalized  information  matrices,  I{8q)/T, 
is  bounded  from  below  by  a  positive  definite  matrix,  as  T  goes  to  infinity. 
In  special  case  models,  other  identification  conditions  are  often  easier  to  deal 
with  and  are  therefore  preferred.  For  example,  for  VARMA  processes  the  iden¬ 
tification  conditions  given  in  Section  12.1.2  may  be  used. 

18.4.3  Maximization  of  the  Log-Likelihood  Function 

From  some  previous  chapters  we  know  that  maximization  of  the  log-likelihood 
function  is  in  general  a  nonlinear  optimization  problem.  Therefore,  numerical 
methods  are  required  for  its  solution.  One  possibility  is  a  gradient  algorithm 
as  described,  for  example,  in  Section  12.3.2  for  iteratively  minimizing  —  Ini. 
Recall  that  the  general  form  of  the  i-th  iteration  step  is 

(18.4.5) 

a, 

where  .s,  is  the  step  length  and  Di  is  a  positive  definite  direction  matrix.  The 
inverse  information  matrix  is  one  possible  choice  for  this  matrix.  In  that  case, 
the  method  is  called  scoring  algorithm.  We  will  provide  the  ingredients  for  this 
algorithm  in  the  following,  that  is,  we  will  give  expressions  for  the  gradient 
of  In  l  and  an  estimator  of  the  information  matrix.  There  are  various  ways  to 
choose  the  step  length  s*.  For  instance,  it  could  be  chosen  so  as  to  optimize 
the  progress  towards  the  minimum.  Another  alternative  would  be  to  simply 
set  Si  =  1.  We  will  not  discuss  the  step  length  selection  in  further  detail  here 
because  it  is  of  limited  importance  for  the  statistical  analysis  of  the  model. 

The  Gradient  of  the  Log-Likelihood 

From  (18.4.3),  we  get 

din l  1  ^  /din \St\ \ ,  dvec(Et)  dtr(e'tSj~1et) 

~88r  =  ~^{r[  VGCV  9St  )  cM7  +  88’ 
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where  0et/05 '  may  be  replaced  by  —  0yt\t-i/05' . 


The  Information  Matrix 


Using  E(ete't)  =  Et  (=  Ey(t\t  —  1))  and  E[et{50)]  =  0,  straightforward  appli¬ 
cation  of  the  rules  for  matrix  and  vector  differentiation  yields  the  information 
matrix, 


ns  o) 


-E 


02  In  l 

0505' 

V 

E 


dvec{Ety_  i  I'gvecQgt) 

05  (  ’  05' 


-2E 


de[  i  get 

*  aa' 


(18.4.7) 


Because  the  true  parameter  values  involved  in  this  expression  are  unknown, 
they  are  replaced  by  estimators  and  the  expectation  is  simply  dropped.  For  in¬ 
stance,  in  the  i-th  iteration  of  the  scoring  algorithm,  <5;  is  used  as  an  estimator 
for  80. 


Discussion  of  the  Scoring  Algorithm 

The  scoring  algorithm  may  have  poor  convergence  properties  far  away  from 
the  maximum  of  the  log-likelihood  function.  On  the  other  hand,  it  has  very 
good  convergence  properties  close  to  the  maximum.  Unfortunately,  it  may  be 
expensive  in  terms  of  computation  time  because  it  requires  (possibly  numeri¬ 
cal)  evaluation  of  derivatives  in  each  iteration.  Therefore,  other  maximization 
methods  were  proposed  in  the  literature.  Notably  the  EM  (expectation  step- 
maximization  step)  algorithm  of  Dempster,  Laird  &  Rubin  (1977)  was  found 
to  be  useful  in  practice  (see  Watson  &  Engle  (1983),  Schneider  (1992)).  The 
EM  algorithm  is  an  iterative  algorithm  which  has  the  advantage  of  involving 
much  cheaper  computations  in  each  iteration  step  than  the  scoring  algorithm. 
On  the  other  hand,  convergence  of  the  former  is  slower  than  that  of  the  lat¬ 
ter  algorithm.  Nicholls  &  Pagan  (1985)  and  Schneider  (1991,  1992)  suggested 
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combining  the  EM  and  the  scoring  algorithms.  This  proposal  may  be  useful 
if  no  good  initial  estimator  <5 1  is  available  from  where  to  start  the  scoring 
algorithm.  Another  alternative  is  to  use  the  so-called  subspace  algorithm  for 
getting  initial  values  (e.g.,  Bauer  &  Wagner  (2002)). 

18.4.4  Asymptotic  Properties  of  the  ML  Estimator 

We  consider  the  state  space  model  from  Section  18.3.1  with  transition  equation 
zt  =  Bzt-i  +  Fxt-i  +  wt- 1  (18.4.8) 

and  measurement  equation 

yt  =  H  t.zt  +  Gxt  +  v>t .  (18.4.9) 

All  assumptions  of  Section  18.3.1  are  taken  to  be  satisfied.  In  addition  we 
assume  that 

(i)  the  true  parameter  vector  is  in  the  interior  of  the  parameter  space  which 
is  supposed  to  be  compact; 

(ii)  Ht  =  (xt  ®  I)  J,  where  J  is  a  known  selection  matrix  such  as  J  =  [Ik  '■ 
0  :  •  •  •  :  0]  or  Ht  =  H  is  a  time  invariant  nonstochastic  matrix; 

(iii)  the  inputs  Xt  are  nonstochastic  and  uniformly  bounded,  that  is,  there 
exist  real  numbers  C\  and  c-i  such  that  C\  <  x'txt  <  for  all  f  =  0, 1, 2, . . .; 

(iv)  the  sequence  of  normalized  information  matrices  is  bounded  from  below 
by  a  positive  definite  matrix,  that  is,  there  exists  a  constant  c  such  that 
T~1X{S0)  >  cl„  or,  in  other  words,  T~1I(8q)  —  cln  is  positive  definite,  as 
T  — >  oo; 

(v)  all  eigenvalues  of  B  have  modulus  less  than  1. 

As  we  have  discussed  in  Section  18.4.2,  (iv)  is  an  identification  condition. 
The  last  assumption  is  a  stability  condition,  and  (iii)  guarantees  that  the 
input  variables  have  no  trends.  We  have  seen  in  Chapter  7  that  the  stan¬ 
dard  asymptotic  theory  may  not  apply  for  trending  variables.  Therefore,  they 
are  excluded  here.  With  these  assumptions,  the  following  proposition  can  be 
established. 

Proposition  18.1  ( Asymptotic  Properties  of  the  ML  Estimator) 

With  all  the  assumptions  stated  in  the  foregoing,  the  ML  estimator  S  of  So  is 
consistent  and  asymptotically  normally  distributed, 

Vf(S-6o)-iAf(0,Zg),  (18.4.10) 

where 

Eg  =  limTX(Jo)”1 

is  the  inverse  asymptotic  information  matrix.  It  is  consistently  estimated  by 
substituting  the  ML  estimators  for  unknown  parameters  in  (18.4.7),  dropping 
the  expectation  operator  and  dividing  by  T.  ■ 
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Pagan  (1980)  gives  a  proof  of  this  proposition  based  on  Crowder  (1976)  (see 
also  Schneider  (1988)).  Other  sets  of  conditions  are  possible  to  accommodate 
the  situation  where  the  inputs  ay  are  stochastic.  They  may,  in  fact,  contain 
lagged  yt  s.  Moreover,  B  may  have  eigenvalues  on  the  unit  circle  if  it  does 
not  contain  unknown  parameters.  The  reader  is  referred  to  the  articles  by 
Pagan  (1980),  Nicholls  &  Pagan  (1985),  Schneider  (1988),  and  to  a  book  by 
Caines  (1988)  for  details.  When  particular  models  are  considered,  different 
sets  of  assumptions  are  often  preferable  for  two  reasons.  First,  other  sets 
of  conditions  may  be  easier  to  verify  or  to  understand  for  special  models. 
Second,  the  conditions  of  Proposition  18.1  or  the  modifications  mentioned  in 
the  foregoing  may  not  be  satisfied.  We  will  see  an  example  of  the  latter  case 
shortly. 

A  number  of  alternatives  to  ML  estimation  were  suggested,  see,  e.g.,  An¬ 
derson  &  Moore  (1979),  Nicholls  &  Pagan  (1985),  Schneider  (1988),  and  Bauer 
&  Wagner  (2002)  for  more  details  and  references. 

It  may  be  worth  noting  that  application  of  the  Kalman  filter  to  systems 
with  estimated  parameters  produces  state  estimates  and  precision  matrices 
that  do  not  take  into  account  the  estimation  variability.  Watanabe  (1985) 
and  Hamilton  (1986)  considered  the  properties  of  state  estimators  obtained 
with  estimated  parameter  Kalman  filter  recursions.  Furthermore,  a  state  space 
framework  for  unit  root  processes  was  presented  by  Bauer  &  Wagner  (2003). 

18.5  A  Real  Data  Example 

As  an  illustrative  example,  we  consider  a  dynamic  consumption  function  with 
time  varying  coefficients, 

27  =  lot  +  7i  txt  +  72*27-1  +  73*2/t-i  +  74*27-2  +  75*27-2  +  vt 

=  X'tlt  +  vu  (18.5.1) 

where 


Here  yt.  and  Xt  represent  rates  of  change  (first  differences  of  logarithms)  of 
consumption  and  income,  respectively.  Suppose  that  the  coefficient  vector  jt 
differs  from  it_\  by  an  additive  random  disturbance,  that  is, 

7*  =  7t-i  +  wt-i-  (18.5.2) 

In  other  words,  it  is  driven  by  a  (multivariate)  random  walk.  Clearly,  (18.5.1) 
and  (18.5.2)  represent  the  measurement  and  transition  equations  of  a  state 
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space  model  with  Ht  =  X't  and  B  =  We  complete  the  model  by  assuming 
that  Vt  and  wt  are  independent  Gaussian  white  noise  processes,  vt  ~  M (0,  cr%) 
and  wt  ~  A/"(0,  Sw),  where 


S.,, 


0 


0 


(18.5.3) 


is  a  diagonal  matrix.  Furthermore,  the  initial  state  70  is  also  assumed  to 
be  normally  distributed,  70  ~  Af(j0,  X0),  and  independent  of  vt  and  wt. 
Admittedly,  our  assumed  model  is  quite  simple.  Still,  it  is  useful  to  illustrate 
some  concepts  considered  in  the  previous  sections. 

Assuming  that  a  sample  y  =  (j/i,  •  ■  •  ,  Vt)'  is  available,  the  log-likelihood 
function  of  our  model  is 


In  l(al,Sw,  70,-Eoly) 

T  T 

=  -^ln(2?r)  -  ^^2\n\Sy(t\t  -  1)|  -  ^  ^2(yt  ~  yt\t-i)2  /  £y(t\t  -  1), 

z  t=i  z  t=i 

(18.5.4) 

where  £y(t\t  —  1)  is  a  scalar  ((1  x  1)  matrix)  because  yt  is  a  univariate  variable. 
The  log-likelihood  function  may  be  evaluated  with  the  Kalman  filter  recursions 
for  given  parameters  ,  Sw ,  70 ,  and  Sq.  The  maximization  problem  may 
be  solved  with  an  iterative  algorithm.  Once  estimates  of  the  parameters  a* 
and  Xw  are  available,  estimates  7 t\T  of  the  coefficients  of  the  consumption 
function  (18.5.1)  may  be  obtained  with  the  smoothing  recursions  given  in 
Section  18.3.1. 

Using  first  differences  of  logarithms  of  the  quarterly  consumption  and  in¬ 
come  data  given  in  File  El  for  the  years  1960  to  1982,  we  have  estimated  the 
parameters  of  the  state  space  model  (18.5.1)/ (18.5.2).  The  ML  estimates  of 
the  parameters  of  interest,  namely  the  variances  ay  and  oy,.,  i  =  0, 1, ...  ,5, 
together  with  estimated  standard  errors  (square  roots  of  the  diagonal  elements 
of  the  estimated  inverse  information  matrix)  and  corresponding  f-ratios  are 
given  in  Table  18.1. 

The  interpretation  of  the  standard  errors  and  t-ratios  needs  caution  for 
various  reasons.  In  Proposition  18.1,  where  the  asymptotic  distribution  of  the 
ML  estimators  is  given,  we  have  assumed  that  all  eigenvalues  of  the  transition 
matrix  B  have  modulus  less  than  1.  This  condition  is  clearly  not  satisfied  in 
the  present  example,  where  B  =  and,  thus,  all  six  eigenvalues  are  equal  to 
1.  However,  as  mentioned  in  Section  18.4.4,  the  condition  on  the  eigenvalues 
of  B  is  not  crucial  if  B  is  a  known  matrix  which  does  not  contain  unknown 
parameters.  Of  course,  setting  B  =  Iq  is  just  an  assumption  which  may  or 
may  not  be  adequate. 

A  further  deviation  from  the  assumptions  of  Proposition  18.1  is  that  the  in¬ 
puts  Xt  contain  lagged  endogenous  variables  and  hence  are  stochastic.  Again, 
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Table  18.1.  ML  estimates  for  the  example  model 


parameter  estimate  standard  error  f-ratio 


2 

G  v 

3.91 

X 

10-5 

1.99 
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10”5 

1.97 

2 

aw  0 

2.04 
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10-5 

2.33 

X 

10-5 

.88 

2 

&W\ 

.14 

X 

10~2 

1.08 

X 

10~2 

.13 

2 

(JW2 

.46 

X 

10~2 

.92 

X 

10”2 

.50 

2 

°w3 

.45 

X 

to-2 

1.11 

X 

10”2 

.41 

2 

.51 

X 

to-2 

.94 

X 

10~2 

.54 

2 

gW5 

.62 

X 

10-2 

1.16 

X 

10”2 

.54 

we  have  mentioned  in  Section  18.4.4  that  this  assumption  is  not  necessarily 
critical.  The  conditions  of  Proposition  18.1  could  be  modified  so  as  to  allow 
for  lagged  dependent  variables. 

Another  assumption  that  may  be  problematic  is  the  normality  of  the 
white  noise  sequences  and  the  initial  state.  The  normality  assumption  may  be 
checked  by  computing  the  skewness  and  kurtosis  of  the  standardized  quanti¬ 
ties  (yt  —  yt\t-i) / Sy{t\t  —  l)1/2.  A  test  for  nonnormality  may  then  be  based  on 
the  y2-statistic  involving  both  skewness  and  kurtosis  as  described  in  Chapter 
4,  Section  4.5.  For  the  present  example,  the  statistic  assumes  the  value  3.00 
and  has  a  x2(2)-distribution  under  the  null  hypothesis  of  normality.  Thus,  it 
is  not  significant  at  any  conventional  level. 

Finally,  we  have  assumed  in  Proposition  18.1  that  the  true  parameter  val¬ 
ues  lie  in  the  interior  of  the  parameter  space.  Given  that  the  variance  estimates 
are  quite  small  compared  to  their  estimated  standard  errors,  it  is  possible  that 
at  least  the  cr  ,2  .  are  in  fact  zero  and,  thus,  lie  on  the  boundary  of  the  feasible 
parameter  space.  If  the  cr,2  .  are  actually  zero,  the  jt  are  time  invariant  in  our 
model  which  would  be  a  hypothesis  of  considerable  interest.  It  would  permit 
us  to  work  with  a  constant  coefficient  specification.  Unfortunately,  if  o/y  =  0, 
the  corresponding  t-ratio  does  not  have  an  asymptotic  standard  normal  dis¬ 
tribution  in  general.  Thus,  we  cannot  use  the  f-ratios  given  in  Table  18.1  for 
testing  the  null  hypotheses  cr,2  .  =  0,  *  =  0,1,...,5. 

In  the  present  context,  we  may  ignore  the  problems  related  to  the  asymp¬ 
totic  theory  for  the  moment  and  simply  regard  the  model  as  a  descriptive 
tool.  Using  the  estimated  values  of  the  parameters  of  the  model,  we  may  con¬ 
sider  the  smoothing  estimates  7t|T  of  the  states  (the  coefficients  of  the  con¬ 
sumption  function).  They  are  plotted  in  Figure  18.2.  The  two-standard  error 
bounds  which  are  also  shown  in  the  figure  are  computed  from  the  AT(t|T). 
These  quantities  are  obtained  with  the  smoothing  recursions  given  in  Section 
18.3.1.  From  the  plots  in  Figure  18.2,  it  can  be  seen  that  the  intercept  term 
jot  is  the  only  coefficient  that  exhibits  substantial  variation  through  time.  For 
instance,  a  considerable  downturn  is  observed  in  1966/1967,  where  the  West 
German  economy  was  in  a  recession.  All  the  other  coefficients  show  relatively 
little  variation  through  time,  although  734,  744,  and  754  (the  coefficients  of 
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Fig.  18.2.  Smoothing  estimates  of  the  consumption  function  coefficients  (■ 
coefficient  estimate,  — ^ —  estimated  two-standard  error  bound). 
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yt~  1,  Xt~ 2,  and  yt-i ,  respectively)  have  a  tendency  to  decline  in  the  second 
half  of  the  1970s.  However,  given  the  estimated  two-standard  error  bounds, 
overall  the  results  support  a  specification  with  constant  coefficients  of  current 
and  lagged  income  and  lagged  consumption. 

As  mentioned  previously,  this  example  is  quite  simplistic.  It  is  just  meant 
to  illustrate  some  of  the  concepts  discussed  in  this  chapter.  In  a  more  general 
model,  other  lags  of  income  and/or  consumption  could  appear  on  the  right- 
hand  side  of  the  consumption  function,  the  coefficients  could  be  generated 
by  a  more  general  VAR  model  and  the  covariance  matrix  of  Wt  could  be 
nondiagonal.  Moreover,  the  consumption  function  may  just  be  a  part  of  a 
system  of  equations. 

Given  that  we  have  discussed  different  models  for  the  same  data  in  previous 
chapters,  the  example  also  illustrates  that  there  is  not  just  one  possible  model 
or  model  class  for  the  generation  process  of  a  multiple  time  series.  The  reader 
may  wonder  which  of  the  models  we  have  considered  in  this  and  the  previous 
chapters  is  “best”.  That,  however,  depends  on  the  questions  of  interest.  In 
other  words,  the  time  series  analyst  has  to  decide  on  the  model  with  the 
objective  of  his  or  her  analysis  in  mind.  In  this  book,  we  have  just  tried  to 
introduce  some  of  the  possible  tools  in  this  venture.  With  these  tools  in  hand, 
the  analyst  is  hoped  to  be  able  to  approach  his  or  her  problems  of  interest  in 
a  superior  way,  with  an  improved  sense  of  the  available  possibilities  and  the 
potential  pitfalls. 


18.6  Exercises 

Problem  18.1 

Write  the  VARMAX  model 

Vt  =  A\yt_i  +  •  •  •  +  Apyt-p  +  B$xt.  +  ■  ■  ■  +  Bsxt-S 

+  Ut  +  MlUt-l  +  •  •  •  +  Mq'Ut-q 

in  state  space  form. 

Problem  18.2 

Suppose  that  in  the  dynamic  factor  analytic  model,  yt  =  L  ft+ut,  the  common 
factors  ft  are  generated  by  the  VARMA(p,  q)  process, 

ft  =  Aift-i  +  •  •  •  +  Apft-p  +  rjt  +  Mirjt-i  +  •  ■  ■  +  A/9r/t_g, 

and  the  individual  factors  ut  are  generated  by  the  VARMA(r,  s )  process 


Ut.  —  CiUt—l  +  •  •  •  +  CrUt—r  +  St  +  D\£t-\  +  •  •  •  +  Ds£t-S. 


Write  the  model  in  state  space  form. 
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Problem  18.3 

Assume  that  yt  is  generated  according  to 
Vt  =  AtYt-i  +  Ct.Xt  +  Vt , 

where  At  :=  [Au, . . . ,  Apt\  and  Yt_ i  :=  [y't_lt ....  Suppose  that  at  := 

vec[At  :  Ct\  is  driven  by  the  VARMA(r,  s)  process 

Oit  =  Dia.t-1  +  •  •  •  +  Drat-r  +  r]t  +  M-t  r)t_i  +  ■  ■  ■  +  Msrjt_s. 

Write  the  model  in  state  space  form. 

Problem  18-4 

Write  down  explicitly  the  first  two  steps  of  the  Kalman  filter  recursions. 
Problem  18.5 

Suppose  the  scalar  observable  variable  yt  is  generated  by  the  random  coeffi¬ 
cient  regression  model 


Vt  =  v  +  XtPt  +  vt,  t=l,...,T, 

where  f3t  =  a/?t-i  +  Wt  is  driven  by  an  AR(1)  process.  Suppose  further  that 
Vt  and  Wt  are  independent  zero  mean  Gaussian  white  noise  processes  with 
variance  1  and  let  be  a  standard  normal  random  variable. 

(a)  Determine  the  conditional  distribution  of  /3t  given  yi, ... .  yt-i- 

(b)  Write  down  the  log-likelihood  function  of  the  model  and  derive  its  gradi¬ 
ent.  Find  an  expression  for  the  information  matrix. 

Problem  18.6 

Consider  the  A'-dimensional  Gaussian  stable  VAR(l)  process  yt  =  Ayt-i  +ut 
with  y0  ~  Af( 0, 0)  and  ut  ~  Af( 0,  Su)  for  t  =  1,2,  —  Use  the  Kalman  filter 
recursions  to  determine  yt\t.-i- 

(a)  Show  that  yt\t-i  =  Ayt-\. 

(b)  Show  that  the  conditions  of  Proposition  18.1  are  satisfied  if 


6  = 


vec(A) 

vech(A„) 


Problem  18.7 

Repeat  the  analysis  of  Section  18.5  with  the  same  data  and  the  state  space 
model  consisting  of  the  measurement  equation 


yt  =  lot  +  litXt  +  l2tVt-i  +  vt 


and  the  transition  equation 
lo,t  lo,t-i 

7M  =  7m-i  +wt- 1- 

_  72 ,t  \  [_  l2,t-l 


Appendices 


A 


Vectors  and  Matrices 


The  following  summary  of  matrix  and  vector  algebra  is  not  meant  to  be  an 
introduction  to  the  subject  but  is  just  a  brief  review  of  terms  and  rules  used  in 
the  text.  Most  of  them  can  be  found  in  books  such  as  Graybill  (1969),  Searle 
(1982),  Anderson  (1984,  Appendix),  Magnus  &  Neudecker  (1988),  Magnus 
(1988)  or  Liitkepohl  (1996a).  Therefore  proofs  or  further  references  are  only 
provided  in  exceptional  cases. 


A.l  Basic  Definitions 


A  matrix  is  a  rectangular  array  of  numbers.  For  instance, 


3  5  .3  .3 
2  2  2  2 

are  matrices.  More  generally, 


3  -5 
.3  0 


(0,1,0), 


A  —  (dij) 


an  . . .  ain 


CLrnl  •  •  •  & mn 


(A.l.l) 


is  a  matrix  with  m  rows  and  n  columns.  Such  a  matrix  is  briefly  called  (m  x 
n)  matrix,  m  being  the  row  dimension  and  n  being  the  column  dimension. 
The  numbers  atj  are  the  elements  or  components  of  A.  In  the  following,  it  is 
assumed  that  the  elements  of  all  matrices  considered  are  real  numbers  unless 
otherwise  stated.  In  other  words,  we  will  be  concerned  with  real  rather  than 
complex  matrices.  If  the  dimensions  m  and  n  are  clear  from  the  context  or  if 
they  are  of  no  importance,  the  notation  A  =  (aij)  means  that  a,7  is  a  typical 
element  of  A ,  that  is,  A  consists  of  elements  atj ,  i  =  1, . . . ,  m,  j  =  1, . . . ,  n. 

A  (1  x  n)  matrix  is  a  row  vector  and  an  (to  x  1)  matrix  is  a  column  vector 
which  is  often  denoted  by  a  lower  case  letter  in  the  text.  If  not  otherwise 
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noted,  all  vectors  will  be  column  vectors  in  the  following.  Instead  of  (m  x  1) 
matrix  we  sometimes  say  (m  x  1)  vector  or  simply  m-vector  or  m-dimensional 
vector. 

An  ( m  x  to)  matrix  with  the  number  of  rows  equal  to  the  number  of 
columns  is  a  square  matrix.  An  (m  x  m)  square  matrix 

"an  0  ...  O' 

0  a22  0 

0  0  ...  amn 

with  zeros  off  the  main  diagonal  is  a  diagonal  matrix.  If  all  the  diagonal 
elements  of  a  diagonal  matrix  are  one,  it  is  an  identity  or  unit  matrix.  An 
(to  x  to)  identity  matrix  is  denoted  by  Im  or  simply  by  /  if  the  dimension  is 
unimportant  or  obvious  from  the  context.  A  square  matrix  with  all  elements 
below  (above)  the  main  diagonal  being  zero  is  called  upper  (lower)  triangular 
or  simply  triangular  matrix.  A  matrix  consisting  of  zeros  only  is  a  null  matrix 
or  zero  matrix.  Usually,  in  this  text,  such  a  matrix  is  simply  denoted  by  0  and 
its  dimensions  have  to  be  figured  out  from  the  context. 

The  transpose  of  the  (to  x  n)  matrix  A  given  in  (A. 1.1)  is  the  (n  x  to) 
matrix 


A'  = 


an 


Gm  1 


(l  In  ■  •  •  nmn 


the  n  rows  of  A'  being  the  n  columns  of  A.  The  matrix  A  is  symmetric  if 
A '  =  A.  For  instance, 


'30' 

'331' 

is  the  transpose  of 

0  10 

3  1 

1  0 

and 

2  -1  ' 

-1  0 

is  a  symmetric  matrix. 


A. 2  Basic  Matrix  Operations 

Let  A  =  ( a-ij )  and  B  =  ( btj )  be  (to  x  n)  matrices.  The  two  matrices  are  equal , 
A  =  B,  if  aij  =  bij  for  all  i,  j.  The  following  matrix  operations  are  basic: 


( addition ) 

( subtraction ) 


A. 3  The  Determinant 
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A -\-  B  : —  (a,ij  +  bij). 
A  B  . —  (ctij  bij ). 

For  a  real  constant  c, 


cA  =  Ac  :=  ( cciij ).  ( multiplication  by  a  scalar) 

Let  C  =  ( Cij )  be  an  [n  x  r)  matrix,  then  the  product 


In  \ 

AC  :=  I  aijCjk  ( multiplication ) 

V=1  / 

is  an  (m  x  r)  matrix.  For  instance, 


'33' 

'23  O' 

'  3-2  +  3-2  3  •  3  +  3  •  4  3-0-3-  1  ' 

2  1 

2  4-1 

2-2+  1-2  2-3  +  1-4  2-0  -  1-1 

12  21  -3 
6  10  -1 


If  the  column  dimension  of  A  is  the  same  as  the  row  dimension  of  C  so  that 
A  and  C  can  be  multiplied,  the  two  matrices  are  conformable.  In  the  product 
AC  the  matrix  C  is  premultiplied  by  A  and  A  is  postmultiplied  by  C. 

Rides:  Suppose  A,  B,  and  C  are  matrices  with  suitable  dimensions  so  that 
the  following  operations  are  defined  and  c  is  a  scalar. 

(1)  A  +  B  =  B  +  A. 

(2)  (A  +  B)  +  C  =  A+(B  +  C). 

(3)  A(B  +  C)  =  AB  +  AC. 

(4)  c(A  +  B)=cA  +  cB. 

(5)  AB  ^  BA  in  general. 

(6)  (AB)C  =  A(BC). 

(7)  (AB)'  =  B'A'. 

(8)  AI  =  IA  =  A. 

(9)  AA'  and  A' A  are  symmetric  matrices. 


A. 3  The  Determinant 

The  determinant  of  an  (to  x  to)  square  matrix  A  =  (a,y)  is  the  sum  of  all 
products 


(  ^ mim 

consisting  of  precisely  one  element  from  each  row  and  each  column  multiplied 
by  —1  or  1,  depending  on  the  permutation  ill...,imof  the  subscripts.  The  — 1 
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is  used  if  the  number  of  inversions  of  i±, . . . ,  im  to  obtain  the  order  1,2, ...  ,m 
is  odd  and  1  is  used  otherwise.  The  sum  is  taken  over  all  to!  permutations  of 
the  column  subscripts. 

For  a  (1  x  1)  matrix  the  determinant  equals  the  value  of  the  single  element 
and  for  m  >  1  the  determinant  may  be  defined  recursively  as  follows.  Suppose 

an  a\2 

021  o 22 

is  a  (2  x  2)  matrix.  Then  the  determinant  is 

det  (A)  =  |A|  =  ana22  —  012O21.  (A. 3.1) 

For  instance, 


To  specify  the  determinant  of  a  general  (to  x  to)  matrix  A  =  (ajj)  we  define 
the  minor  of  the  ij- th  element  as  the  determinant  of  the  ((to  — 1)  x  (to  —  1)) 
matrix  that  is  obtained  by  deleting  the  *-th  row  and  j-th  column  from  A.  The 
cofactor  of  aij ,  denoted  by  Atj ,  is  the  minor  multiplied  by  (— l)*+b  Now 

det  (A)  =  |A|  =  a,;iAji  +  •  •  •  +  o  imA,;m  =  a±jAij  +  •  •  •  +  amjAmj  (A. 3. 2) 

for  any  i  or  j  £  {1, . . . ,  to}.  It  does  not  matter  which  row  or  column  is  chosen 
in  (A. 3. 2)  because  the  determinant  of  a  matrix  is  a  unique  number. 

For  example,  for  the  (3  x  3)  matrix 

'2  13" 

A=  0  2  1  (A.3.3) 

1-14 

the  minor  of  the  upper  right-hand  corner  element  is 


The  cofactor  is  also  —2  because  (— 1)1+3  =  1.  Developing  by  the  first  row 
gives 


|A|  =  2  ■  clet 


2  1 
-1  4 


—  1  •  det 


3  •  det 


0  2 
1  -1 


The  same  result  is  obtained  by  developing  by  any  other  row  or  column,  e.g., 
using  the  first  column  gives 

|A|  =  2  •  det  \  \  ~  0  '  det  j  ^  +  1  •  det  \  \  =13. 
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Rules:  In  the  following  rules,  A  =  (ojj)  and  B  =  ( bij )  are  (to  x  to)  matrices 
and  c  is  a  scalar. 

(1)  clet(/m)  =  1. 

(2)  If  A  is  a  diagonal  matrix,  det(A)  =  an  •  022  ■  •  •  amm. 

(3)  If  A  is  a  lower  or  upper  triangular  matrix,  |A|  =  an  ■  •  •  amm. 

(4)  If  A  contains  a  row  or  column  of  zeros,  |A|  =  0. 

(5)  If  B  is  obtained  from  A  by  adding  to  one  row  (column)  a  scalar  multiple 
of  another  row  (column),  then  |A|  =  \B\. 

(6)  If  A  has  two  identical  rows  or  columns,  then  |A|  =  0. 

(7)  clet(cA)  =  cm  det(A). 

(8)  \AB\  =  \A\\B\. 

(9)  If  C  is  an  (to  x  n)  matrix,  det (/,„  +  CC')  =  det(I„  +  C'C). 


A. 4  The  Inverse,  the  Adjoint,  and  Generalized  Inverses 


A. 4.1  Inverse  and  Adjoint  of  a  Square  Matrix 


An  (to  x  to)  square  matrix  A  is  nonsingular  or  regular  or  invertible  if  there 
exists  a  unique  (to  x  to)  matrix  B  such  that  AB  =  Im.  The  matrix  B  is 
denoted  by  A-1.  It  is  the  inverse  of  A, 

AA"1  =  A-1  A  =  Im. 


For  to  >  1,  the  (to.  x  to)  matrix  of  cofactors, 

1 


Aadi  = 


An  •  •  ■  Aim 
A  ,  A 

nml  •  •  •  ■r^mm 


is  the  adjoint  of  A.  For  a  (1  x  1)  matrix  A,  we  define  the  adjoint  to  be  1,  that 
is,  Aad-1  =  1.  To  compute  the  inverse  of  the  (to  x  to)  matrix  A,  the  relation 

A-1  =  |A|-1A“*'  (A. 4.1) 


is  sometimes  useful.  For  this  expression  to  be  meaningful,  |A|  has  to  be 
nonzero.  Indeed,  A  is  nonsingular  if  and  only  if  det(A)  0. 

As  an  example  consider  the  matrix  given  in  (A. 3. 3).  Its  adjoint  is 


j^adj 


2 

1 

0 

1 

0 

2 

-1 

4 

1 

4 

1 

-1 

1 

3 

2 

3 

2 

1 

-1 

4 

1 

4 

1 

-1 

1 

3 

2 

3 

2 

1 

2 

1 

0 

1 

0 

2 

9  —7  -5 
1  5  —2 

-2  3  4 
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Consequently, 

9  -7  -5  ' 

1  5  -2  . 

-2  3  4 

Multiplying  this  matrix  by  A  is  easily  seen  to  result  in  the  (3  x  3)  identity 
matrix. 

Rules: 

(1)  For  an  (to  x  to)  square  matrix  A,  AA0^  =  AadjA  =  \A\lm. 

(2)  An  (to  x  to)  matrix  A  is  nonsingular  if  and  only  if  det(A)  ^  0. 

In  the  following,  A  =  (ciij)  and  B  are  nonsingular  ( in  x  to)  matrices  and 
c  ^  0  is  a  scalar  constant. 

(3)  A"1  =  Aadi /\A\. 

(4)  (A')”1  =  (A-1)'. 

(5)  (AB)~1  =  B~yA~x. 

(6)  (cA)-1  =  \A~\ 

(7)  I-1  =  Irn. 

(8)  If  A  is  a  diagonal  matrix,  then  A  1  is  also  diagonal  with  diagonal  elements 

1  /  Qjii  • 

(9)  For  an  (to  x  n)  matrix  C,  {l.m  +  CC")_1  =  lnl  —  C(ln  +  C'C)~1C'. 


A. 4. 2  Generalized  Inverses 

Let  A  be  an  ( mxn )  matrix.  Any  matrix  B  satisfying  ABA  =  A  is  a  generalized 
inverse  of  A.  For  example,  if 

a  =  r 1  °i 
L  o  0  J’ 

the  following  matrices  are  generalized  inverses  of  A: 


10" 

"10" 

"10" 

0  1 

7 

0  0 

7 

.0  1. 

Obviously,  a  generalized  inverse  is  not  unique  in  general.  An  ( n  x  to)  matrix  B 
is  called  Moore-Penrose  ( generalized )  inverse  of  A  if  it  satisfies  the  following 
four  conditions: 

ABA  =  A, 

BAB  =  B, 

(. AB )'  =  AB, 

(BA)'  =  BA. 


(A  .4.2) 
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The  Moore-Penrose  inverse  of  A  is  denoted  by  A+,  it  exists  for  any  (to  x  n) 
matrix  and  is  unique. 

Rides:  (See  Magnus  &  Neudecker  (1988,  p.  33,  Theorem  5).) 

(1)  A+  =  A if  A  is  nonsingular. 

(2)  (A+)+  =  A. 

(3)  {A’)+  =  (A+)'. 

(4)  A'AA+  =  A+ A  A1  =  A'. 

(5)  A'A+’A+  =  A+A+'A'  =  A+. 

(6)  (A'A)+  =  H+A+',  (AA')+  =  A+'A+. 

(7)  H+  =  (A'A)+A'  =  A'(AA')+. 


A. 5  The  Rank 

Let  Xi, ...  ,xn  be  (to.  x  1)  vectors.  They  are  linearly  independent  if,  for  the 
constants  ci, . . . ,  c„, 


ciaq  H - 1-  cnxn  =  0 

implies  c±  =  ■  ■  ■  =  cn  =  0.  Equivalently,  defining  the  (n  x  1)  vector  c  = 
(c i, . . . ,  cny  and  the  (to  x  n)  matrix  X  =  (aq, . . . ,  xn),  the  columns  of  X  are 
linearly  independent  if  Xc  =  0  implies  c  =  0.  The  columns  of  X  are  linearly 
dependent  if  c\X\  +  ■  ■  ■  +  cnxn  =  0  holds  with  at  least  one  Ci  ^  0.  In  that  case, 


Xi  —  d\X\  ‘  '  '  di—\Xi—\  T  di^-\Xi^-\  “t“  ‘  ‘  *  T  dnXnj 

where  d3  =  — Cj/ci .  In  other  words,  x\,...,xn  are  linearly  dependent  if  at 
least  one  of  the  vectors  is  a  linear  combination  of  the  other  vectors. 

If  n  >  to,  the  columns  of  X  are  linearly  dependent.  Consequently,  if 
xi,...,xn  are  linearly  independent,  then  n  <  to. 

Let  oi, . . . ,  a7l  be  the  columns  of  the  (to.  x  n)  matrix  A  =  (on, . . . ,  an). 
That  is,  the  are  (to  x  1)  vectors.  The  rank  of  A,  briefly  rk(A),  is  the  max¬ 
imum  number  of  linearly  independent  columns  of  A.  Thus,  if  n  <  m  and 
the  Oi, . . . ,  an  are  linearly  independent,  rk(A)  =  n.  The  maximum  number  of 
linearly  independent  columns  of  A  equals  the  maximum  number  of  linearly 
independent  rows.  Hence,  the  rank  may  be  defined  equivalently  as  the  max¬ 
imum  number  of  linearly  independent  rows.  If  to  >  n  (to  <  n)  then  we  say 
that  A  has  full  rank  if  rk(A)  =  n  (rk(A)  =  to). 

Rules:  Let  A  be  an  (to  x  n)  matrix. 

(1)  rk(H)  <  min(m,n). 

(2)  rk(H)  =  rk {A'). 

(3)  rk {AA')  =  rk(H'yl)  =  rk(A). 

(4)  If  B  is  a  nonsingular  (n  x  n)  matrix,  then  rk(AH)  =  rk(H). 
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(5)  If  rk(A)  =  to,  then  A+  =  A'  (AA')-1 . 

(6)  If  rk(.A)  =  n,  then  A+  =  {A' A)-1  A' . 

(7)  If  B  is  an  (n  x  r)  matrix,  rk (AB)  <  min{rk(A), rk(I?)}. 

(8)  If  A  is  (to  x  to),  then  rk(A)  =  m  if  and  only  if  |A  ^  0. 


A. 6  Eigenvalues  and  -vectors  —  Characteristic  Values 
and  Vectors 


The  eigenvalues  or  characteristic  values  or  characteristic  roots  of  an  (to  x 
to)  square  matrix  A  are  the  roots  of  the  polynomial  in  A  given  by  det(A  — 
XI,,n)  or  det(A Im  —  A).  The  determinant  is  sometimes  called  the  characteristic 
determinant  and  the  polynomial  is  called  the  characteristic  polynomial  of  A. 
Because  the  roots  of  a  polynomial  are  complex  numbers,  the  eigenvalues  are 
also  complex  in  general.  A  number  A;  is  an  eigenvalue  of  A,  if  the  columns 
of  {A  —  \ilrn)  are  linearly  dependent.  Consequently,  there  exists  an  (m  x  1) 
vector  Vi  ^  0  such  that 


(A  -  \ilm)vi  =  0  or  Avi  =  XiVi- 


A  vector  with  this  property  is  an  eigenvector  or  characteristic  vector  of  A 
associated  with  the  eigenvalue  A,;.  Of  course,  any  nonzero  scalar  multiple  of 
Vi  is  also  an  eigenvector  of  A  associated  with  Aj. 

As  an  example  consider  the  matrix 


Its  eigenvalues  are  the  roots  of 


|  A  —  XI, \  =  det 


1  -  A 
1 


0 

3- A 


(1  —  A)  (3  —  A). 


Hence,  Aj  =  1  and  A2  =  3  are  the  eigenvalues  of  A.  Associated  eigenvectors 
are  obtained  by  solving 


'  1 

0  ' 

Vll 

Vll 

and 

'  1 

0  ' 

Vl2 

=  3 

Vl2 

1 

3 

.  v 21  . 

.  v 21  . 

1 

3 

.  v 22  . 

.  V22 

Thus, 


Vll 

_ 

1 " 
1 

and 

Vl2 

_ 

■  0  ' 

.  v 21  . 

.  ~2  . 

.  V22  . 

1 

are  eigenvectors  of  A  associated  with  Ai  and  A2,  respectively. 

In  the  following  rules,  the  modulus  of  a  complex  number  z  =  Z\  +  iz 2  is 
used.  Here  Z\  and  z2  are  the  real  and  imaginary  parts  of  z,  respectively,  and 
i  =  a/VL  The  modulus  |z|  of  z  is  defined  as 
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If  Z2  =  0  so  that  z  is  a  real  number,  the  modulus  is  just  the  absolute  value  of 

z,  which  justifies  the  notation. 

Rules: 

(1)  If  A  is  symmetric,  then  all  its  eigenvalues  are  real  numbers. 

(2)  The  eigenvalues  of  a  diagonal  matrix  are  its  diagonal  elements. 

(3)  The  eigenvalues  of  a  triangular  matrix  are  its  diagonal  elements. 

(4)  An  (m  x  to)  matrix  has  at  most  to  eigenvalues. 

(5)  Let  Ai, . . . ,  A m  be  the  eigenvalues  of  the  (m  x  to)  matrix  A,  then  |A|  = 
Ai  ■  •  •  Am,  that  is,  the  determinant  is  the  product  of  the  eigenvalues. 

(6)  Let  A i  and  A j  be  distinct  eigenvalues  of  A  with  associated  eigenvectors  Vi 
and  Vj .  Then  vt  and  Vj  are  linearly  independent. 

(7)  All  eigenvalues  of  the  (to  x  to)  matrix  A  have  modulus  less  than  1  if  and 
only  if  det (Irn  —  Az)  -=/=■  0  for  \z\  <  1,  that  is,  the  polynomial  det ( —  Az) 
has  no  roots  in  and  on  the  complex  unit  circle. 

A. 7  The  Trace 

The  trace  of  an  ( m  x  to)  square  matrix  A  =  ( aij )  is  the  sum  of  its  diagonal 

elements, 

tr  A  =  tr (A)  :=  an  H - b  amm. 

For  example, 


Rides:  A  and  B  are  (to  x  to)  matrices  and  Ai , . . . ,  Am  are  the  eigenvalues  of 

A. 

(1)  tr(A  +  B)  =  tr(A)  +  tr (B). 

(2)  tr  A  =  tr  A! . 

(3)  If  C  is  (to  x  n)  and  D  is  (n  x  to),  tr  (CD)  =  tr  (DC). 

(4)  tr  A  =  Ai  +  •  •  •  +  Xm. 


A. 8  Some  Special  Matrices  and  Vectors 


A. 8.1  Idempotent  and  Nilpotent  Matrices 

An  (?n  x  m)  matrix  A  is  idempotent  if  AA  =  A2  =  A.  Examples  of  idempotent 
matrices  are  A  =  Im,  A  =  0,  and 
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An  (to  x  to)  matrix  A  is  nilpotent  if  there  exists  a  positive  integer  i  such  that 
A1  =  0.  For  instance,  the  (2  x  2)  matrices 


are  nilpotent  because  A2  =  B2  =  0. 

Rules:  In  the  following  rules,  A  is  an  (to  x  to)  matrix. 

(1)  If  A  is  a  diagonal  matrix,  it  is  idempotent  if  and  only  if  all  the  diagonal 
elements  are  either  zero  or  one. 

(2)  If  A  is  symmetric  and  idempotent,  rk(A)  =  tr(A). 

(3)  If  A  is  idempotent  and  rk(A)  =  to,  then  A  =  I  rn. 

(4)  If  A  is  idempotent,  then  lm  —  A  is  idempotent. 

(5)  If  A  is  symmetric  and  idempotent,  then  A+  =  A. 

(6)  If  B  is  an  (to  x  n)  matrix,  then  BB+  and  B+  B  are  idempotent. 

(7)  If  A  is  idempotent,  then  all  its  eigenvalues  are  zero  or  one. 

(8)  If  A  is  nilpotent,  then  all  its  eigenvalues  are  zero. 


A. 8. 2  Orthogonal  Matrices  and  Vectors  and  Orthogonal 
Complements 

Two  (to  x  1)  vectors  x  and  y  are  orthogonal  if  x'y  =  0.  They  are  orthonormal 
if  they  are  orthogonal  and  have  unit  length,  where  the  length  of  a  vector  x  is 
||x||  :=  y/x'x. 

An  (to  x  k )  matrix  B  is  orthogonal  to  the  (to  x  n)  matrix  A  if  A'B  =  0. 
If  A  is  an  (in  x  n)  matrix  of  full  column  rank,  an  orthogonal  complement 
of  A,  denoted  by  Aj_,  is  an  ( m  x  (to  —  n))  matrix  of  full  column  rank  such 
that  A'Aj_  =  0.  The  orthogonal  complement  of  a  nonsingular  square  matrix 
is  zero  and  the  orthogonal  complement  of  a  zero  matrix  is  an  identity  matrix 
of  suitable  dimension. 

An  (to  x  to)  square  matrix  A  is  orthogonal  if  its  transpose  is  its  inverse, 
A'A  =  AA'  =  lm.  In  other  words,  A  is  orthogonal  if  its  rows  and  columns  are 
orthonormal  vectors. 

Examples  of  orthogonal  vectors  are 
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The  following  four  matrices  are  orthogonal  matrices: 


COS  ip 

sin  ip 

7 

—  sin  ip 

cos  ip 

0  10' 

'  1/V3  I/n/3  I/a/3  " 

10  0 

7 

l/y/2  -l/s/2  0 

0  0  1 

_  l/v/6  1/V6  -2/V6  _ 

Suppose 

'  1  0 
A  =  11 

.  0  2 

then 


1  " 

2  " 

-1 

1 

2  . 

and 

-2 

1 

are  orthogonal  complements  of  A. 


Rules: 

(1)  I,n  is  an  orthogonal  matrix. 

(2)  If  A  is  an  orthogonal  matrix,  then  det(A)  =  1  or  — 1. 

(3)  If  A  and  B  are  orthogonal  and  conformable  matrices,  then  AB  is  orthog¬ 
onal. 

(4)  If  A i,  A j  are  distinct  eigenvalues  of  a  symmetric  matrix  A ,  then  the  corre¬ 
sponding  eigenvectors  i \  and  Vj  are  orthogonal. 

(5)  For  an  (m  x  n)  matrix  A  of  full  column  rank  and  n  <  m,  the  matrix 
[A  :  A±\  is  invertible. 


A. 8. 3  Definite  Matrices  and  Quadratic  Forms 

Let  A  be  a  symmetric  (m  x  m)  matrix  and  x  an  ( m  x  1)  vector.  The  func¬ 
tion  x' Ax  is  called  a  quadratic  form  in  x.  The  symmetric  matrix  A  or  the 
corresponding  quadratic  form  is 

(i)  positive  definite  if  x' Ax  >  0  for  all  m-vectors  x  ^  0; 

(ii)  positive  semidefinite  if  x' Ax  >  0  for  all  m-vectors  x ; 

(iii)  negative  definite  if  x' Ax  <  0  for  all  m-vectors  x  ^  0; 

(iv)  negative  semidefinite  if  x' Ax  <  0  for  all  m-vectors  x\ 

(v)  indefinite  if  x' Ax  >  0  for  some  x  and  x' Ax  <  0  for  another  x. 

Rules:  In  the  following  rules,  A  is  a  symmetric  (m  x  m)  matrix. 
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(1)  A  =  ( dij )  is  positive  definite  if  and  only  if  all  its  principle  minors  are 
positive,  where 


det 


an 


Oil 


Oli 

O'ii 


is  the  *-th  principle  minor  of  A. 

(2)  A  is  negative  definite  (semidefinite)  if  and  only  if  — A  is  positive  definite 
(semidefinite) . 

(3)  If  A  is  positive  or  negative  definite,  it  is  nonsingular. 

(4)  All  eigenvalues  of  a  positive  (negative)  definite  matrix  are  greater  (smaller) 
than  zero. 

(5)  A  diagonal  matrix  is  positive  (negative)  definite  if  and  only  if  all  its  diag¬ 
onal  elements  are  positive  (negative). 

(6)  If  A  is  positive  definite  and  B  an  (m  x  n)  matrix,  then  B'AB  is  positive 
semidefinite. 

(7)  If  A  is  positive  definite  and  B  an  (m  x  n)  matrix  with  rk(I3)  =  n,  then 
B'AB  is  positive  definite. 

(8)  If  A  is  positive  definite,  then  A-1  is  positive  definite. 

(9)  If  A  is  idempotent,  then  it  is  positive  semidefinite. 

With  these  rules  it  is  easy  to  check  that 


2  1 
1  1 


and 


3  10 
1  1  0 
0  0  4 


are  positive  definite  matrices  and 


1  1 
1  1 


and 


1  0 
0  0 


are  positive  semidefinite  matrices. 


A. 9  Decomposition  and  Diagonalization  of  Matrices 


A. 9.1  The  Jordan  Canonical  Form 


Let  A  be  an  (to  x  m)  matrix  with  eigenvalues  Ai, . . . ,  A„.  Then  there  exists  a 
nonsingular  matrix  P  such  that 


Ai 


An 


=:  A  or  A  =  PAP ~1, 


P~1AP  = 


0 


(A. 9.1) 
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where 

'  Xi 

1  0  .. 

.  0 

0 

Xi  1 

0 

A  = 

0 

0 

.  1 

_  0 

0  . 

•  A,  _ 

This  decomposition  of  A  is  the  Jordan  canonical  form.  Because  the  eigenvalues 
of  A  may  be  complex  numbers,  A  and  P  may  be  complex  matrices.  If  multiple 
roots  of  the  characteristic  polynomial  exist,  they  may  have  to  appear  more 
than  once  in  the  list  Ai, . . . ,  A„. 

The  Jordan  canonical  form  has  some  important  implications.  For  instance, 
it  implies  that 


A3  =  (PAP-y  =  PAjP -1 
and  it  can  be  shown  that 

a;  (  {  Ur1  ... 


M  = 


o  xi 


where 


P 


p\ 


3 .  )  Arn+1 

n-  l 

3  9  )  xrrt+2 

Ti  —  2 


xi 


«/  {p~q)W- 

denotes  a  binomial  coefficient.  We  have  the  following  rules. 


Rides:  Suppose  A  is  a  real  (to  x  to)  matrix  with  eigenvalues  Ai, . . . ,  A„  which 
have  all  modulus  less  than  1,  that  is,  |A;|  <  1  for  i  =  1, . . . ,  n.  Furthermore, 
let  A  and  P  be  the  matrices  given  in  (A.9.1). 

(1)  A>  =  PA^P-1  — »  0. 

j— too 

(2)  Yf?=  oA3  =  -  A)~l  exists. 

(3)  The  sequence  A^ ,  j  =  0, 1, 2, . . .,  is  absolutely  summable,  that  is, 

OO 

i  akij  i 

3=0 

is  finite  for  all  k,l  =  1, . . . ,  to,  where  ctki,j  is  a  typical  element  of  AK  (See 
Section  C.3  regarding  the  concept  of  absolute  summability.) 
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A. 9. 2  Decomposition  of  Symmetric  Matrices 

If  A  is  a  symmetric  (m  x  in)  matrix,  then  there  exists  an  orthogonal  matrix 
P  such  that 


At 


P'AP  =  A  = 


0 


0 


or  A  =  PAP' , 


(A.9.2) 


where  the  Aj’s  are  the  eigenvalues  of  A  and  the  columns  of  P  are  the  corre¬ 
sponding  eigenvectors.  Here  all  matrices  are  real  again  because  the  eigenvalues 
of  a  symmetric  matrix  are  real  numbers.  Denoting  the  i-th  column  of  P  by  pi 
and  using  that  p\pj  =  0  for  i  ^  j,  we  get 

m 

A  =  PAP'  =  ^2  XiPip'i-  (A.9.3) 

i=  1 


Moreover, 

A2  =  PAP' PAP'  =  PA2P' 


and,  more  generally, 

Ak  =  pAkp,  _ 


If  A  is  a  positive  definite  symmetric  (to  x  to)  matrix,  then  all  eigenvalues  are 
positive  so  that  the  notation 


A1/2  := 


\/  Am 


makes  sense.  Defining  Q  =  PA1/2P',  we  get  QQ  =  A.  In  generalization  of  the 
terminology  for  positive  real  numbers,  Q  may  be  called  a  square  root  of  A  and 
may  be  denoted  by  A1/2. 


A.9.3  The  Choleski  Decomposition  of  a  Positive  Definite  Matrix 

If  A  is  a  positive  definite  (to.  x  to)  matrix,  then  there  exists  a  lower  (upper) 
triangular  matrix  P  with  positive  main  diagonal  such  that 

P~1AP'~1  =  Im  or  A  =  PP'.  (A.9.4) 

Similarly,  if  A  is  positive  semidefinite  with  rk(A)  =  n  <  m,  then  there  exists 
a  nonsingular  matrix  P  such  that 
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P~1AP'~1 


In  0 

0  0 


Alternatively,  A  =  QQ' ,  where 


Q  =  P 


0 


0 

0 


(A.9.5) 


For  instance, 


"  26 

3 

0 

"  5 

1 

0  " 

3 

9 

0 

= 

0 

3 

0 

0 

0 

81 

0 

0 

9 

V26  0  0 

3/1/26  15/-/26  0 
0  0  9 


•/26  3/V26  0 
0  15/v/26  0 

0  0  9 


The  decomposition  A  =  PP' ,  where  P  is  lower  triangular  with  positive  main 
diagonal,  is  sometimes  called  Choleski  decomposition.  Computer  programs  are 
available  to  determine  the  matrix  P  for  a  given  positive  definite  matrix  A.  If 
a  lower  triangular  matrix  P  is  supplied  by  the  program,  an  upper  triangular 
matrix  Q  can  be  obtained  as  follows:  Define  an  (to  x  to)  matrix 


G  = 


0  ...  0  1 

0  ...  1  0 


1  0  0 


with  ones  on  the  diagonal  from  the  upper  right-hand  corner  to  the  lower  left- 
hand  corner  and  zeros  elsewhere.  Note  that  G'  =  G  and  G_1  =  G.  Suppose  a 
decomposition  of  the  (to  x  m)  matrix  A  is  desired.  Then  decompose  B  =  GAG 
as  B  =  PP' ,  where  P  is  lower  triangular.  Hence, 

A  =  GBG  =  GPGGP'G  =  QQ', 


where  Q  =  GPG  is  upper  triangular. 


A.  10  Partitioned  Matrices 


Let  the  (to  x  n)  matrix  A  be  partitioned  into  submatrices  An,  Ai2,  A21,  A22 
with  dimensions  (p  x  q),  (p  x  ( n  —  q)),  ((to  —  p)  x  q),  and  ((to.  —  p)  x  (n  —  q)), 
respectively,  so  that 


A  = 


An  A12 
A21  A22 


(A. 10.1) 


For  such  a  partitioned  matrix,  a  number  of  useful  results  hold. 
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Rules: 


A'  A' 


(!)  A!  =  2'1  A'1  • 

L  ^12  ^22  J 

(2)  If  n  =  m  and  q  =  p  and  A ,  An,  and  A 22  are  nonsingular,  then 


A”1  = 


-DA12A.22 


—  A22  A2\D  A22  +  A22  A21-DA12  A22 

A1i  +  Axl  A12GA2i  —An1  A12G 

—GA2-iA7}  G 


[  - GA21A ^  G  J’ 

where  D  :=  (An  —  Ai2A221A2i)  1  and  G  (A22  —  A2iA111Ai2)  1. 

(3)  Under  the  conditions  of  (2), 

(An  —  Ai2A22  A2i)  1  =  +  A111Ai2(A22  —  A2iA111Ai2)  1A2iAh1. 

(4)  Under  the  conditions  of  (2),  if  A12  and  A21  are  null  matrices, 


A-x  = 


0  A22 


(5)  If  A  is  a  square  matrix  (n  =  m )  and  An  is  square  and  nonsingular,  then 
\A\  =  | All  |  •  1^22  —  A2\An  A\2\. 

(6)  If  A  is  a  square  matrix  and  A22  is  square  and  nonsingular,  then  |A|  = 
\A22\  ■  | An  —  A\2A22  A21I. 


A.  11  The  Kronecker  Product 

Let  A  =  (ay)  and  B  =  (6^)  be  (m  x  n)  and  (p  x  q)  matrices,  respectively. 
The  (mp  x  nq)  matrix 

a\\B  . . .  ainB 

A®B:=  :  :  (A.11.1) 

Cm  1  B  . . .  urnnB 

is  the  Kronecker  product  or  direct  product  of  A  and  B.  For  example,  the 
Kronecker  product  of 

.  I"  3  4  — 1  1  .  „  I"  5  —  1  1  /A  it  n\ 

A“  [2  0  0  J  and  B  ~  [  3  3  J  (A. 11.2) 

is 

3  20  -4  -5  1  ' 

9  12  12  -3  -3 

2  0  0  0  0 

6  0  0  0  0 
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and 


15 

20 

-5 

-3 

—4 

1 

10 

0 

0 

—2 

0 

0 

9 

12 

-3 

9 

12 

-3 

6 

0 

0 

6 

0 

0 

Rules:  In  the  following  rules,  suitable  dimensions  are  assumed. 

(1)  A  £g)  B  B  ®  A  in  general. 

(2)  (A  (g>  B)'  =  A'  ®B'. 

(3)  A®(B  +  C)=A®B  +  A®C. 

(4)  {A®B){C®D)=  AC®BD. 

(5)  If  A  and  B  are  invertible,  then  (A  ®  B)^1  =  A-1  ®  B"1. 

(6)  If  A  and  B  are  square  matrices  with  eigenvalues  Xa,Xb,  respectively,  and 
associated  eigenvectors  va,vb,  then  A^Ab  is  an  eigenvalue  of  A®B  with 
eigenvector  va  ®vb- 

(7)  If  A  and  B  are  ( in  x  m)  and  (n  x  n)  square  matrices,  respectively,  then 
\A®B\  =  \A\n\B\m. 

(8)  If  A  and  B  are  square  matrices, 

tr (A  ®  B)  =  tr(A)tr(I3). 

(9)  (A®  B)+  =  A+  (g)  B+. 


A.  12  The  vec  and  vech  Operators  and  Related  Matrices 


A. 12.1  The  Operators 

Let  A  =  (ai, . . . ,  an)  be  an  {in  x  n)  matrix  with  {m  x  1)  columns  cq.  The  vec 
operator  transforms  A  into  an  ( mn  x  1)  vector  by  stacking  the  columns,  that 
is, 


a  i 


vec  (A)  = 

For  instance,  if  A  and  B  are  as  in  (A. 11. 2),  then 


vec  (A) 


3 


2 

4 

0 

-1 

0 


and  vec  {B) 


5 

3 

-1 

3 


Rules:  Let  A,  B ,  C  be  matrices  with  appropriate  dimensions. 
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(1)  vec(A  +  B)  =  vec(A)  +  vec (B). 

(2)  vec  (ABC)  =  (C'  0  A)vec(B). 

(3)  vec (AB)  =  (/  0  A)vec(-B)  =  (B'  0  I)vec(A). 

(4)  vec  (ABC)  =  (/  0  AB)vec(C)  =  (C'B'  0  I)vec.(A). 

(5)  vec(i?,)'vec(A)  =  tr(HA)  =  tr(AB)  =  vec(A')'vec(B). 

(6)  tr (ABC)  =  vec (A')'(C'  0  I)vec(B) 

=  vec (A')'(1 0  B)vec(C) 

=  vec (B')'(A'  0  I)vec(C) 

=  vec (B')'(1 0  C')vec(A) 

=  vec (C'Y(B'  0  I)vec(A) 

=  vec (C')'(1 0  A)vec(B). 

The  vech  operator  is  closely  related  to  vec.  It  only  stacks  the  elements  on 
and  below  the  main  diagonal  of  a  square  matrix.  For  instance, 


vech 


an 

<312 

<313 

<^21 

<322 

<323 

= 

c«31 

<332 

<333  _ 

l,  if  A 

is  an 

(?n  x  to) 

an 
a  21 
a  31 
«22 
«32 
a  33 


vector.  The  vech  operator  is  usually  applied  to  symmetric  matrices  to  collect 
the  separate  elements  only. 


A. 12. 2  Elimination,  Duplication,  and  Commutation  Matrices 

The  vec  and  vech  operators  are  related  by  the  elimination  matrix,  Lm,  and 
the  duplication  matrix,  Dm.  The  former  is  an  +  1)  x  to2)  matrix  such 

that,  for  an  ( in  x  to)  square  matrix  A, 

vech(A)  =  Lmvec(A).  (A.  12.1) 


Thus,  e.g.,  for  to.  =  3, 


"  1 

0 

0 

0 

0 

0 

0 

0 

°1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

L3  — 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

ij 

The  duplication  matrix  Dm 

.  is 

(to2 

X 

(to  +  1))  and  is  defined  so  that,  for 

any  symmetric  (m  x  m )  matrix  A, 


vec(A)  =  Dmvech(A). 


(A.12.2) 
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For  instance, 

for 

m  = 

=  3, 

r  1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

0 

0 

II 

CO 

Q 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

Lo 

0 

0 

0 

0 

1 

Because  the  rank  of  Dm  is  easily  seen  to  be  m(m  +  l)/2,  the  matrix  D'mDm 
is  invertible.  Thus,  left-multiplication  of  (A. 12. 2)  by  (D'nDm)_1Djn  gives 

(D'mDm)-1D^vec(A)  =  vech(A).  (A.12.3) 

Note,  however,  that  (D'nDm)_1Djn  ^  Lm  in  general  because  (A.12.3)  holds 
for  symmetric  matrices  A  only  while  (A.  12.1)  holds  for  arbitrary  square  ma¬ 
trices  A. 

The  commutation  matrix,  Kmn,  is  another  matrix  that  is  occasionally 
useful  in  dealing  with  the  vec  operator.  Kmn  is  an  (mn  x  mn)  matrix  defined 
such  that,  for  any  ( m  x  n)  matrix  A, 


vec(A')  =  Kmnvec(A) 
or,  equivalently, 

vec  (A)  =  Kramvec(A'). 


For  example, 


K32  = 


1  0  0 
0  0  0 
0  10 
0  0  0 
0  0  1 
0  0  0 


because  for 


A  = 


Of 11  Oi\2 
021  022 
O31  O32 


0  0  0 
10  0 
0  0  0 
0  10 
0  0  0 
0  0  1 


Oil 

Oil 

0 1 2 

021 

«21 

=  K32 

031 

022 

012 

031 

022 

.  a32 

.  a32 

vec  (A') 


K32vec(A). 
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Rules: 

(7)  —  Irri(rri- {-1} /2  - 

(8)  KmmDm  Dm. 

(9)  Kmi  —  Kim  =  I ln . 

(10)  K'mri  =  K-\  =  K„m. 

(11)  tr  K mm  =  m. 

(12)  det(Kmn)  =  (_l)"»n(m-l)(n-l)/4_ 

(13)  tr(D(nDm)  =  TO2,  tr(D'mDm)_1  =  m(m  +  3)/4. 

(14)  det(D'mDm)  = 

(15)  tr(DmD(J  =  to2. 

(16)  \D'm(A  ®  A)Dm|  =  2m(m-1)/2|A|m+1,  where  A  is  an  (to.  x  to)  matrix. 

(17) .  (D'm(A  ®  A)Dra)_1  =  (D'mDm)-1D'm(^-1®J4-1)Dm(D'mDm)-1,  if  A 

is  a  nonsingular  (to.  x  to)  matrix. 

(18)  —  I‘rri(m+l)/2- 

(19)  LmL^  and  LmK„Lj„  are  idempotent. 

Let  A  and  B  be  lower  triangular  (to  x  to)  matrices.  Then  we  have  the  following 
rules: 


(20)  L m(A  (g)  B) L'm  is  lower  triangular. 

(21)  L'mLm(A  g>  B) L'm  =  (A'  <g>  B)L'm. 

(22)  [Lm(A'  g)  B)Vm]s  =  Lm((A')s  (g>  Bs)Um  for  s  =  0,1,...  and  for  s  = 
. . . ,  —2,  — 1,  if  A~l  and  B_1  exist. 

Let  G  be  (to  x  n),  F  (p  x  q),  and  b  (p  x  1).  Then  the  following  results  hold: 

(23)  K pm{G®F)  =  (F  (g)  G)Kqn. 

(24)  Kpm(G  g)  F)K.nq  =  F  ®G. 

(25)  Kpm(G  g>  b)  =  b  (g)  G. 

(26)  Kpm(b  g)  G)  =  G  <g>b. 

(27)  vec(G  <g>  F)  =  (/„  (g>  K9?n  ®  /p)(vec(G)  ®  vec(F)). 

(28)  (D'mDm)-1D'mK.mm  =  (D'mDm)_1D(„. 
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In  the  following,  it  will  be  assumed  that  all  derivatives  exist  and  are  con¬ 
tinuous.  Let  /(/ 3)  be  a  scalar  function  that  depends  on  the  (n  x  1)  vector 


s 

II 

<n. 

,PnY- 

"  dl_  1 

df  . 

dPi 

Of  . 

'9/ 

dfj 

dp  ' 

I  g 

’  9/?'  ■ 

dPn . 
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are  (n  x  1)  and  (1  x  n)  vectors  of  first  order  partial  derivatives,  respectively, 
and 

d2f  d2f  ' 

d/3idPi  ' ' '  d(3id/3n 


d2f  d2f 

d0nd0i  d0nd0n  _ 

is  the  (n  x  n)  Hessian  matrix  of  second  order  partial  derivatives.  If  f{A)  is  a 
scalar  function  of  an  (to  x  n)  matrix  A  =  ( ),  then 

df_  ,=  r  df  - 

dA  '  [  da  .ij  _ 

is  an  (to  x  n)  matrix  of  partial  derivatives.  If  the  (to  x  n)  matrix  A  =  (al:j) 
depends  on  the  scalar  (3,  then 


d2f  ,=  r  o2f  ' 

dfldp'  '  [d0idpjm 


dA  T  da.ij 

80  ~  [W 


is  an  (to  x  n)  matrix.  If  y(0)  =  (t/i (0) , . . . ,  y,n{P))'  is  an  (to  x  1)  vector  that 
depends  on  the  (n  x  1)  vector  0,  then 

dryi  dy^  ~ 

901  ”  ’  90n 

dym  dym 

90i  90n  . 

is  an  (to  x  n)  matrix  and 

W  ,=  (9y\ 

dp  '  \dP' )  ' 


dy_  _ 

d0'  ' 


For  example,  if  0  =  {01,02)'  and  f{0) 

df 


If 


d£ 

dp 


y{P)  = 


901 

df_ 
d0 2 


01  +  02 
PP1 


201  -  202 
- 20i 


then 


dy 

dp' 


0i  —  20102,  then 


Wl  1 

e01  0 


The  following  two  propositions  are  useful  for  deriving  rules  for  vector  and 
matrix  differentiation. 
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Proposition  A.l  ( Chain  Rule  for  Vector  Differentiation) 

Let  a  and  (3  be  (to  x  1)  and  (n  x  1)  vectors,  respectively,  and  suppose  h(a)  is 
(p  x  1)  and  g{0)  is  ( to  x  1).  Then,  with  a  =  g(P), 

dh(g((3))  _  dh(a)  dg{P) 

d/3'  da'  d/3'  P  n 


Proposition  A. 2  ( Product  Rules  for  Vector  Differentiation ) 

(a)  Suppose  p  is  (m  x  1),  a{(3)  =  (ai(/3), . . . ,  an((3))'  is  (n  x  1),  c(/3)  = 
(ci(/3), . . . ,  cp((3))'  is  ( p  x  1)  and  A  =  (a^)  is  (n  x  p)  and  does  not  depend  on 
(3.  Then 

d[a{(3)'Ac{P)\  ,  ,da{01  ,  dc{p) 

w  " m  ~W  m 

(b)  If  (3  is  a  (1  x  1)  scalar,  A(P)  is  (to  x  n)  and  B(/3)  is  (n  x  p),  then 

dAB  _dA  dB 
d/3  d(3  +  d(3 ' 


(c)  If  [3  is  an  (to  x  1)  vector,  A{(3)  is  (n  x  p)  and  B(/3)  is  (p  x  g),  then 


dvec(AB) 

dp' 


( Iq  <8>  A) 


dvec(B) 

d/3’ 


+  ( B '  ®  In) 


dvec(A) 

d/3' 


Proof: 

(a) 


d(a'Ac) 

dp' 


(b) 


d  E< 


H  UjlJ  '-'J 


i,3 


E 

hi 


dp’ 

da j 


dcj 

QP'Wi  +  a^  — 


c’A’  — 
A  dp' 


i  A 


dc 

Wr 


El  aijbjk 

3 


AB  = 


and 
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.  -  daij  dbjk 

- ep - =  E  ~wbjk  +  aijW  ■ 

3 

(c)  Follows  from  (b)  by  stacking  the  columns  of  AB  and  writing  the  resulting 
columns  d  vec(AB) / d f3i  for  i  =  1, . . .  ,m  in  one  matrix.  ■ 

The  following  rules  are  now  easy  to  verify. 

Rules: 

(1)  Let  A  be  an  (m  x  n)  matrix  and  j3  be  an  (n  x  1)  vector.  Then 

^  =  A  and  %f=A> 
dp  d/3 

Proof:  This  result  is  a  special  case  of  Proposition  A. 2(a).  ■ 

(2)  Let  A  be  (to  x  to)  and  (3  be  (to  x  1).  Then 

Proof:  See  Proposition  A. 2(a).  ■ 

(3)  If  A  is  (to  x  to)  and  (3  is  (to  x  1),  then 


d2(3'A/3 

d(3d(3' 


=  A  +  A'. 


Proof:  Follows  from  (1)  and  (2). 

(4)  If  A  is  a  symmetric  {in  x  to)  matrix  and  (3  an  (to  x  1)  vector  then 


d2/3'A/3 

d!3d/3' 


=  2A. 


Proof:  See  (3).  ■ 

(5)  Let  17  be  a  symmetric  (n  x  n)  matrix  and  c{(3)  an  {n  x  1)  vector  that 
depends  on  the  (to  x  1)  vector  (3.  Then 

dc{(3)' Qc{!3)  ,  dc{P) 


a2c(/3)'fic(/3)  J8t(U|'  8c(ffl 


-Wn-W  +  W)‘^^ 


dvec  {dc{P)' /d/3) 


In  particular,  if  y  is  an  {n  x  1)  vector  and  X  an  (n  x  to)  matrix, 
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d(y  -  Xp)'Q(y  -  X(3) 
dp' 

and 


-2 (y  -  Xp)'S2X 


d2(y  -  X(j)'fl(y  -  Xp) 
dPdp' 


2  X' flX. 


Proof:  Follows  from  Proposition  A. 2 (a).  ■ 

(6)  Suppose  (3  is  (to  x  1),  B(/3)  is  (n  x  p),  A  is  (k  x  n),  and  C  is  (p  x  q)  and 
the  latter  two  matrices  do  not  depend  on  [3.  Then 


dvec(ABC) 

w 


(C'  ®  A) 


dvec  (B) 
dp' 


Proof:  Follows  from  Rule  (2),  Section  A. 12,  and  Proposition  A.l.  ■ 

(7)  Suppose  P  is  (to  x  1),  A(0)  is  (n  x  p),  D(P)  is  ( q  x  r),  and  C  is  (p  x  q) 
and  does  not  depend  on  p.  Then 


dvecfACD) 

w 


=  Or 


AC)^AA  +  (D'c®,„)dvec(A> 


dp' 


dp' 


Proof:  Follows  from  Proposition  A. 2(c)  by  setting  B  =  CD  and  noting 
that  dvec(CD)/dpr  =  (lr  ®  C)dvec(D)/dpr .  ■ 

(8)  If  p  is  (to  x  1)  and  A(P)  is  (n  x  n),  then,  for  any  positive  integer  h, 


dvec(Ah) 

dp' 


~h- 1 

^{A')h~1-'  ®  A1 

.*= 0 


d  vec(A) 
dp' 


Proof:  Follows  inductively  from  Proposition  A. 2(c).  The  result  is  evident 
for  h—  1.  Assuming  it  holds  for  h  —  1  gives 


dvec(AAh  x) 
dp' 


(In  ®  A) 


~h—2 

J^iApb-2-*  ®  A1 

_i= 0 


+((A')h~1 


®  In) 


dvec  (A) 
dp' 


<9vec(A) 

dp' 


(9)  If  A  is  a  nonsingular  (m  x  to)  matrix,  then 


3vec(A  :) 
9vec(A)' 


-(A-1)'  ®A-\ 


Proof:  Using  Proposition  A. 2(c), 
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dvec (Im)  dvec  (A  lA) 

dvec  {A)'  9vec(A)' 


+{A'  <g>  lrn ) 


<9vec(A) 
9vec(A)' 
d  vec(A_1) 
<9vec(A)' 


(10)  Let  A  be  a  symmetric  positive  definite  (in  x  to )  matrix  and  let  P  be 
a  lower  triangular  (m  x  to )  matrix  with  positive  elements  on  the  main 
diagonal  such  that  A  =  PP' .  Moreover,  let  Lm  be  an  (|to(to  +  1)  x  m2) 
elimination  matrix  such  that  Lmvec(H)  =  vech(A)  consists  of  the  elements 
on  and  below  the  main  diagonal  of  A  only.  Then 

dveeh(A)'  =  ®  P)Kmm  +  (P  ® 

=  2  +  K nlTll)(P  ®  lrn)h'm}  1, 

where  Kmm  is  an  (m2  xm2)  commutation  matrix  such  that  Kmmvec(P)  = 
vec(P'). 

Proof:  See  Liitkepohl  (1989a).  ■ 

(11)  If  A  =  ( dij )  is  an  (m  x  m)  matrix,  then 

dtr(A)  r 
dA  ~  "l' 

Proof:  tr(A)  =  an  +  •  •  •  +  amm.  Hence, 

dtr(A)  =  f  0  if  i  /  j, 
daij  \  1  if  *  =  j- 


(12)  If  A  =  (a,ij)  is  ( m  x  n)  and  B  =  ( bij )  is  (n  x  m),  then 


dtr(AB)  nl 

—dA~-B- 

Proof:  Follows  because  tr (AB)  =  J2j= i  +  •  •  •  +  J2j=i  amjbjm ■  ■ 

(13)  Suppose  A  is  an  (m  x  n)  matrix  and  H,  C  are  (to  x  m)  and  (n  x  m), 
respectively.  Then 

dtr(BAC)  = 
dA 


Proof:  Follows  from  Rule  (12)  because  tr (BAC)  =  tr (ACB).  ■ 

(14)  Let  A,B,C,D  be  (m  x  n),  (n  x  n),  (m  x  n),  and  (n  x  m)  matrices, 
respectively.  Then 


Str  (DABA'C) 


CDAB  +  D'C'AB'. 


dA 
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Proof:  See  Murata  (1982,  Appendix,  Theorem  6a).  ■ 

(15)  Let  A,  B,  and  C  be  (m  x  m)  matrices  and  suppose  A  is  nonsingular. 
Then 


3tr  (BA~lC) 
dA 


—(A~1CBA~1)'. 


Proof:  By  Rule  (6)  of  Section  A.  12, 

tr (BA~XC)  =  vec (B')'(C'  ®  Im)  vec(A_1). 


Hence,  using  (9), 
3tr  (BA^C) 


=  —vec(B')  (C  ®  /m)((A  )'  ®  A-1) 


3  vec  {A)’ 

=  -  [(A^C®  A"1')  vec(H')]' 
=  -  [vec  (A-VB'C'A-V)]' 
by  Rule  (2)  of  Section  A. 12. 

(16)  Let  A  =  ( dij )  be  an  (m  x  m)  matrix.  Then 


m 

dA 

where  Aad3  is  the  adjoint  of  A. 

Proof:  Developing  by  the  i-tli  row  of  A  gives 


|A|  —  An  “f“  *  *  *  T  ^imAirm 


where  A,j  is  the  cofactor  of  dij.  Hence, 


m 

dan 


=  A, 


because  Ai:j  does  not  contain  a,y. 


(17)  If  A  is  a  nonsingular  (m  x  to)  matrix  with  |A|  >  0,  then 


<91n|A| 

~d. aT-[A)  • 

Proof:  Using  Proposition  A.l  (chain  rule), 


31n|A|  _  3 In  | A|  3|A|  1  dj  ,  ,  x 

3A  3|A|  ‘  3A  |A|  ’  1  ’ 
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Proposition  A. 3  ( Taylor’s  Theorem ) 

Let  f(/3)  be  a  scalar  valued  function  of  the  (m  x  1)  vector  p.  Suppose  /(/3)  is 
at  least  twice  continuously  differentiable  on  an  open  set  S  that  contains  Po,P, 
and  the  entire  line  segment  between  /3q  and  (3 .  Then  there  exists  a  point  f3  on 
the  line  segment  such  that 

m  =  m)  +  -0o)  +  \{0- p^p -  Po),  (a. i3.i) 

where  df{p 0)/d(3'  :=  (df/dP'\p0).  U 

The  expansion  of  /  given  in  (A.  13.1)  is  a  second  order  Taylor  expansion 
at  or  around  /3q. 
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Suppose  f(P)  is  a  real  valued  (scalar)  differentiable  function  of  the  (to  x  1) 
vector  (3.  A  necessary  condition  for  a  local  optimum  (minimum  or  maximum) 
at  P  is  that 


o 

for  P  =  P,  that  is,  := 

'df 

dp 

dp 

dp 

P_ 

In  other  words,  /(■)  has  a  stationary  point  at  p.  If  this  condition  is  satisfied 
and  the  Hessian  matrix  of  second  order  partial  derivatives 

d2f 
dpdp ' 

is  negative  (positive)  definite  for  /?  =  /?,  then  P  is  a  local  maximum  (mini¬ 
mum)  . 

If  a  set  of  constraints  is  given  in  the  form 

<p(P)  =  {‘Pi (P),  ■  ■■,  Pn{P)Y  =  0, 

that  is,  <p(P)  is  an  (n  x  1)  vector,  then  a  local  optimum,  subject  to  these 
constraints,  is  obtained  at  a  stationary  point  of  the  Lagrange  function 

£{P,\)  =  f(P)-\/ip{P), 


where  A  is  an  (n  x  1)  vector  of  Lagrange  multipliers.  In  other  words,  a  necessary 
condition  for  a  constrained  local  optimum  is  that 


and 


PC 


d\ 


hold  simultaneously. 

The  following  results  are  useful  in  some  optimization  problems. 
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Proposition  A. 4  ( Maximum  of  tr(B' 121?)) 

Let  12  be  a  positive  semidefinite  symmetric  ( K  x  K)  matrix  with  eigenvalues 
Ai  >  A2  >  •  •  •  >  A k  and  corresponding  orthonormal  (iv  x  1)  eigenvectors 
Vi,  V2,  ■  ■  ■ ,  vk-  Moreover,  let  B  be  a  ( K  x  r)  matrix  with  B'B  =  I,  .  Then  the 
maximum  of  tr(2?'122?)  with  respect  to  B  is  obtained  for 

B  =  B  =  [ur, . . .  ,vr\ 


and 

max  tr (2?'121?)  =  Ai  +  •  •  •  +  Xr. 


Proof:  The  proposition  follows  from  Theorem  6,  p.  205,  of  Magnus  &  Neu- 
decker  (1988)  by  induction.  For  r  =  1,  our  result  is  just  a  special  case  of  that 
theorem.  For  r  >  1,  assuming  that  the  proposition  holds  for  r  —  1  and  denoting 
the  columns  of  B  by  b\ , . . . ,  br , 


pil 

"  b[f2b  1 

* 

tr  (B'CB)  =  tr 

. 

: 

C[bi, . . . ,  br\  =  tr 

Ik  \ 

* 

b'rf2br 

=  b[f2bi  +  ■  ■  ■  +  b'r_1f2br-i  +  b’rf2br 
—  Ai  +  •  •  •  +  Ar_i  +  b'rQbr 

and  max  b'rf2br  =  v1  f2vr  =  Ar,  under  the  conditions  of  the  proposition,  by  the 
aforementioned  theorem  from  Magnus  &  Neudecker  (1988).  ■ 

The  next  proposition  may  be  regarded  as  a  corollary  of  Proposition  A. 4. 

Proposition  A. 5  ( Minimum  of  tr(Y  —  BCX)' A7,“1(y  —  BCX )) 

Let  Y,  X,  XU,B,  and  C  be  matrices  of  dimensions  (. K  x  T),  (Kp  x  T),  (K  x 
K),(K  x  ?’),  and  (r  x  Kp),  respectively,  with  Su  positive  definite,  rk(l?)  = 
rk(C)  =  r,  rk(A)  =  Kp,  and  rk(V)  =  K.  Then  a  minimum  of 

tr[(P  -  BCX)'S-1{Y  -  BCX)]  (A.14.1) 

with  respect  to  B  and  C  is  obtained  for 

B  =  B  =  S1J2V  and  C  =  C  =  V' Eff1,2YX' {XX')~l ,  (A.14.2) 

where  V  =  [v\, . . .  ,vr]  is  the  ( K  x  r)  matrix  of  the  orthonormal  eigenvectors 
corresponding  to  the  r  largest  eigenvalues  of 

^  1/2  F  A' ( A  A'O  ” 1  A'V' A" 1/2 


in  nonincreasing  order. 
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Proof:  We  first  assume  Su  =  Ik- 

tr[(y  -  BCX)'(Y  -  BCX )] 

=  tr [(Y  -  BCX)(Y  -  BCX)1) 

=  [vec(y)  -  vec(BCW)]'[vec(y)  -  vec (BCX)]  fA.i4.4j 

=  [vec(y)  —  ( X'  ®  B)vec(C,)]/  [vec(V)  —  (X'  ®  B)vec(C')] . 

A  derivation  similar  to  that  in  Section  3.2.1  shows  that  this  sum  of  squares  is 
minimized  with  respect  to  vec (C)  when  this  vector  is  chosen  to  be 

vec (d)  =  ](X®B')(X'  ®  B)]-1  (X  ®  B')vec(Y) 

=  (XX'  ®B'B)~1  vec(B'YX') 

=  vecl(B'B)-1B'YX'(XX')-1]. 

Because  we  may  normalize  the  columns  of  B ,  we  choose  B'B  =  without 
loss  of  generality.  Hence, 

C  =  B’YX'(XX’)-1.  (A.  14.4) 

Substituting  for  C  in  (A.  14. 3)  gives 

tr[(y  -  BB'YX’(XX')~lX)(Y  -  BB'Y  X'  (XX')-1  X)'] 

=  tr (YY')  -  tr(BB'Y X' (XX')-1  XY')  -  tr(Y  X' (XX')-1  XY'BB') 

+  tr(BB'YX'(XX')-1XX'(XX')-1XY'BB') 

=  tr  (YY')  -  ^(B'YX^XX'^XY'B), 

where  again  B'B  =  has  been  used.  This  expression  is  minimized  with 
respect  to  B ,  where 

^tr  B'YX'(XX')-1XY'B 

assumes  its  maximum.  By  Proposition  A. 4,  the  maximum  is  attained  if  B 
consists  of  the  eigenvectors  corresponding  to  the  r  largest  eigenvalues  of 

^Y  X'  (X  X')-1  XY' 

which  proves  the  proposition  for  Eu  =  Ik- 
If  SU^IK, 

tr[(y  -  BCX)'E~1(Y  -  BCX)]  =  tr [(Y*  -  B*CX)’(Y*  -  B*CX)\ 

has  to  be  minimized  with  respect  to  B #  and  C.  Here  Y#  =  Eu1^2Y  and 
B #  =  Eu  1^2B.  From  the  above  derivation  the  solution  is  B #  =  V  and 

d  =  B#'Y#  X'  (XX')-1  =  V'Xf^YX'tXX')-1, 


where  the  columns  of  V  are  the  eigenvectors  corresponding  to  the  r  largest 
eigenvalues  of 
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^ Y*X'(XX')-1XY *'  =  ^E-'^YX'iXX'^XY'E-1'2. 

Hence,  B  =  EXJ2B*  =  El/2V .  ■ 

A  result  similar  to  that  in  Proposition  A. 4  also  holds  for  the  maximum  and 
minimum  of  a  determinant.  The  following  proposition  is  a  slight  modification 
of  Theorem  15  of  Magnus  &  Neudecker  (1988,  Chapter  11). 

Proposition  A. 6  ( Maximum  and  Minimum  of  \CfiC'\) 

Let  1?  be  a  positive  definite  symmetric  ( K  x  K)  matrix  with  eigenvalues 
Ai  >  A2  >  •  •  •  >  A k  and  corresponding  orthonormal  ( K  x  1)  eigenvectors 
Vi, ,  v k ■  Furthermore,  let  C  be  an  (r  x  K )  matrix  with  CC'  =  .  Then 

max \Cf2C'\  =  Ai  •  •  •  Ar 

and  the  maximum  is  attained  for 
C  =  C=[vu...,vr\. 

Moreover, 

min  \CfiC'\  =  XkAr— 1  •  •  •  Ax— r+i 
and  the  minimum  is  attained  for 
C  —  C  —  [vr,  ■  ■  ■  ,  VR-r+l]' ■ 


An  important  implication  of  this  proposition  is  used  in  Chapter  7  and  is  stated 
next. 

Proposition  A. 7  ( Minimum  of  |T_1(F  —  BCX)(Y  —  BCX)'\) 

Let  Y  and  X  be  ( K  x  T)  matrices  of  rank  K  and  let  B  and  C  be  of  rank  r  and 
dimensions  ( K  x  r)  and  (r  x  K),  respectively.  Furthermore,  let  Ai  >  •  •  •  >  Ax 
be  the  eigenvalues  of 

( A  A')  “ 1/2  A :Y'  {YY')YX'  (AI')- 1/2' 


and  the  corresponding  orthonormal  eigenvectors  are  v\, . . .  ,vr-  Here 
(AA')-1/2 

is  some  matrix  satisfying 

(II')-1/2(II')(II')-1/2'  =  IK. 


Then 
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min  \T~\Y  -  BCX)(Y  -  BCX)'\  =  7  A'V' (1  -  Ai)  •  •  •  (1  -  Ar) 
B,C 

and  the  minimum  is  attained  for 

C  =  d=[v1,...,vr]'(XX')-1/2 

and 

B  =  B=  YX'C'iCXX’C'y1. 


A  proof  of  this  proposition  can  be  found  in  Tso  (1981).  It  should  be  noted 
that  the  minimizing  matrices  B  and  C  are  not  unique.  Any  nonsingular  (r  x  r) 
matrix  F  leads  to  another  set  of  minimizing  matrices  FC ,  BF _1. 


A.  15  Problems 


The  following  problems  refer  to  the  matrices 


A  = 


5  2 

-1  1  ’ 


B  = 


6  0  0 

-610’ 


C  = 


14  0 
2  2  2 
12  0 


D  = 


H(P) 


4/?i  2/?i  +  /?2 

1  +  /?2  3 


Problem  A.l 

Determine  A  +  D,  A  —  2D,  A\  AB ,  BC ,  B'A,  B'A',  A®  D,  B  ®  D,  D  ®  B, 
B'  ®D\  B  +  BC,  tr  A ,  tr  D,  det  A,  |D|,  |C|,  vec (B),  vec(B'),  vech(C'),  K33, 
A-1,  D_1,  (A  ®  D)_1,  rk(C),  rk(D),  det(A  ®  D ),  tr(A  ®  D),  C_1  (use  the 
rules  for  the  partitioned  inverse). 


Problem  A. 2 

Determine  the  eigenvalues  of  A,  D ,  and  A®  D. 


Problem  A. 3 

Find  an  upper  triangular  matrix  Q  such  that  D  =  QQ'  and  find  an  orthog¬ 
onal  matrix  P  such  that  D  =  PAP' ,  where  A  is  a  diagonal  matrix  with  the 
eigenvalues  of  D  on  the  main  diagonal.  Compute  D5. 


Problem  A.J^ 

Is  F  =  12  —  BB'  idempotent?  Is  BB'  positive  definite? 
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Problem  A. 5 

Determine  the  following  derivatives. 

ddet(H)  02  det(H)  Otr (H) 

d/3  ’  0/30/3'  ’  OP  ’ 

Ovec(H)  0vec(H2)  OH((3)/3 

Op'  ’  Op'  ’  Op'  ’ 

where  P  =  (Pi,  p2)' ■ 

Problem  A.  6 

Determine  the  stationary  points  of  \H\  with  respect  to  p.  Are  they  local 
extrema? 

Problem  A.  7 

Give  a  second  order  Taylor  expansion  of  det(-ff)  around  P  =  (0,0)'. 


B 


Multivariate  Normal  and  Related  Distributions 


B.l  Multivariate  Normal  Distributions 

A  AT-dimensional  vector  of  continuous  random  variables  y  =  (y i, . . . ,  yx)1  has 
a  multivariate  normal  distribution  with  mean  vector  y  =  (/n, . . . ,  yx)'  and 
covariance  matrix  £,  briefly 

A), 

if  its  distribution  has  the  probability  density  function  (p.d.f.) 

f(y)  =  (27r)*r/2 I^T1/2 exP  -\(y- v)'z~l{y- v)  ■  (B.i.i) 

Alternatively,  y  ~  M{y1£),  if  for  any  iv-vector  c  for  which  c' £c  f  0  the 
linear  combination  c'y  has  a  univariate  normal  distribution,  that  is,  c'y  ~ 
J\f(c'y,c'£c)  (see  Rao  (1973,  Chapter  8)).  This  definition  of  a  multivariate 
normal  distribution  is  useful  because  it  carries  over  to  the  case  where  £  is 
positive  semidefinite  and  singular,  while  the  multivariate  density  in  (B.1.1)  is 
only  meaningful,  if  £  is  positive  definite  and,  hence,  nonsingular.  It  must  be 
emphasized,  however,  that  the  two  definitions  are  equivalent,  if  £  is  positive 
definite  rather  than  just  positive  semidefinite.  Another  possibility  to  define 
a  multivariate  normal  distribution  with  singular  covariance  matrix  may  be 
found  in  Anderson  (1984). 

The  following  results  regarding  the  multivariate  normal  and  related  distri¬ 
butions  are  useful.  Many  of  them  are  stated  in  Judge  et  al.  (1985,  Appendix 
A).  Proofs  can  be  found  in  Rao  (1973,  Chapter  8)  and  Hogg  &  Craig  (1978, 
Chapter  12). 

Proposition  B.l  ( Marginal  and  Conditional  Distributions  of  a  Multivariate 
Normal) 

Let  j/i  and  y2  be  two  random  vectors  such  that 
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V  l 
Vi 


~  M 


Mi 

M2 


•Til  ^12 
^21  ^22 


where  the  partitioning  of  the  mean  vector  and  covariance  matrix  corresponds 
to  that  of  the  vector  (y^y^)' ■  Then, 


2/1  ~  7V(/ii,  27n) 


and  the  conditional  distribution  of  y\  given  y2  =  c  is  also  multivariate  normal, 
(j/i|j/2  =  c)  ~  A/"(/ii  +  ^12^22  (c  —  M2) 5  —  27i2  27221272i). 

If  2722  is  singular,  the  inverse  can  be  replaced  by  a  generalized  inverse.  More¬ 
over,  2/1  and  2/2  are  independent  if  and  only  if  27i2  =  2721  =  0.  ■ 

Proposition  B.2  ( Linear  Transformation  of  a  Multivariate  Normal  Random 
Vector) 

Suppose  2/  ~  A/"(m,  T1)  is  (K  x  1),  A  is  an  ( M  x  K)  matrix  and  c  an  (M  x  1) 
vector.  Then 


x  =  Ay  +  c  ~  J\f(Ay  +  c,  AIM/). 


B.2  Related  Distributions 

Suppose  2/  ~  Af(0,lx)-  The  distribution  of  z  =  y'y  is  a  (central)  chi-square 
distribution  with  K  degrees  of  freedom, 

z~X2(K). 

Proposition  B.3  ( Distributions  of  Quadratic  Forms) 

(1)  Suppose  y  ~  J\f  (0,  Ik)  and  A  is  a  symmetric  idempotent  ( K  x  K)  matrix 
with  rk(A)  =  n.  Then  y' Ay  ~  X2(n)- 

(2)  If  y  ~  7V(0,  27),  where  27  is  a  positive  definite  (AT  x  I\)  matrix,  then 
y'£-xy  ~  x2(A')- 

(3)  Let  y  ~  7V(0,  QA),  where  Q  is  a  symmetric  idempotent  ( K  x  K)  ma¬ 
trix  with  rk(<5)  =  n  and  A  is  a  positive  definite  ( K  x  K)  matrix.  Then 
2 / A~x  y  ~  X2(«)- 

(4)  Suppose  y  ~  7V(0, 27),  where  27  is  a  nonsingular  (A'  x  K)  covariance 
matrix.  Furthermore,  let  A  be  a  ( K  x  K)  matrix  with  rk(A)  =  n.  Then 

1/ Ay  ~  x2(n)  A27A  =  A 

and 

A27A  =  A  =>  y'Ay~x2(n). 
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Proposition  B.4  ( Independence  of  a  Normal  Vector  and  a  Quadratic  Form) 
Suppose  y  ~  A cr2/^),  A  is  a  symmetric,  idempotent  ( K  x  K)  matrix,  B 
is  an  ( M  x  K)  matrix  and  BA  =  0.  Then  By  is  stochastically  independent  of 
the  random  variable  y' Ay.  ■ 

Proposition  B.5  ( Independence  of  Quadratic  Forms) 

Suppose  y  ~  A/’(ju,  a2 Ik )  and  A  and  B  are  symmetric,  idempotent  ( K  x  K) 
matrices  with  AB  =  0,  then  y' Ay  and  y' By  are  stochastically  independent.  ■ 

If  2  ~  Af(0, 1)  and  u  ~  y2(m )  are  stochastically  independent,  then 


yju/m 

has  a  t-  distribution  with  m  degrees  of  freedom,  T  ~  t(m).  If  u  ~  y2(m)  and 
v  ~  x'2(n)  are  independent,  then 


u/m 

v/n 


~  _F(m,  n), 


that  is,  the  ratio  of  two  independent  y2  random  variables,  each  divided  by  its 
degrees  of  freedom,  has  an  F- distribution  with  m  and  n  degrees  of  freedom. 
The  numbers  m  and  n  indicate  the  numerator  and  denominator  degrees  of 
freedom,  respectively. 

Proposition  B.6  ( Distributions  of  Ratios  of  Quadratic  Forms) 

(1)  Suppose  x  ~  N{ 0,  lm)  and  y  ~  Af( 0,  In)  are  independent.  Then 


x'x/m 

y'y/n 


~  F(m,  n). 


(2)  If  y  ~  JV(0, Ik)  and  ^4  and  B  are  symmetric,  idempotent  (K x /i)  matrices 
with  rk(A)  =  to,  rk(B)  =  n  and  AB  =  0,  then 


y'  Ay/ to 
y'By/n 


F(m,  n). 


(3)  x  ~  F(m,  n)  => 


1 

z 


~  F(n,  to). 


If  y  ~  A/"(/U,  Ik),  then  y'y  has  a  noncentral  y2 -distribution  with  A'  degrees 
of  freedom  and  noncentrality  parameter  (or  simply  noncentrality)  r  =  y! p. 
Briefly, 


y'y  ~  X2(A';  t). 
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The  noncentrality  parameter  is  sometimes  defined  differently  in  the  literature. 
For  instance,  A  =  is  sometimes  called  noncentrality  parameter.  Let  w  ~ 
%2(?n;  t)  and  v  ~  X2(n)  be  independent  random  variables,  then 

vj/m 

— - F(to,ti;t), 

v/n 

that  is,  the  ratio  has  a  noncentral  F- distribution  with  to  and  n  degrees  of 
freedom  and  noncentrality  parameter  r. 

Proposition  B.7  ( Quadratic  Form  with  Noncentral  %2 -Distribution) 

If  y  A f(n,  S)  with  positive  definite  (K  x  K)  covariance  matrix  S,  then 
y'S~ly  ~  x2{K\  yi' S~l  jj).  U 


c _ 

Stochastic  Convergence  and  Asymptotic 
Distributions 


It  is  often  difficult  to  derive  the  exact  distributions  of  estimators  and  test 
statistics.  In  that  case,  their  asymptotic  or  limiting  properties,  when  the  sam¬ 
ple  size  gets  large,  are  of  interest.  The  limiting  properties  are  then  regarded 
as  approximations  to  the  properties  for  the  sample  size  available.  In  order  to 
study  the  limiting  properties,  some  concepts  of  convergence  of  sequences  of 
random  variables  and  vectors  are  useful.  They  are  discussed  in  Sections  C.l 
and  C.2.  Infinite  sums  of  random  variables  are  treated  in  Section  C.3.  Laws  of 
large  numbers  and  central  limit  theorems  are  given  in  Section  C.4.  Asymptotic 
properties  of  estimators  are  considered  in  Section  C.5.  Maximum  likelihood 
estimators  and  their  asymptotic  properties  are  discussed  in  Section  C.6  and 
some  common  testing  principles  are  treated  in  Section  C.7.  Finally,  asymp¬ 
totic  properties  of  nonstationary  processes  with  unit  roots  are  dealt  with  in 
Section  C.8. 

This  appendix  contains  a  brief  summary  of  results  used  in  the  text.  Many 
of  these  results  can  be  found  in  Judge  et  al.  (1985,  Section  5.8).  A  more 
complete  discussion  and  proofs  are  provided  in  Fuller  (1976),  Roussas  (1973), 
Scrfling  (1980),  Davidson  (1994,  2000)  and  other  more  advanced  books  on 
statistics.  Further  references  will  be  given  in  the  following. 


C.l  Concepts  of  Stochastic  Convergence 

Let  Xi,X2,  or  {xt},  T  =  1, 2, . . . ,  be  a  sequence  of  scalar  random  variables 
which  are  all  defined  on  a  common  probability  space  (12,  J~ .  Pr).  The  sequence 
{xt}  converges  in  probability  to  the  random  variable  x  (which  is  also  defined 
on  (12,  iF,  Pr))  if  for  every  e  >  0, 

lim  Pr(|£T  —  a;|  >  e)  =  0 

T  — »oo 

or,  equivalently, 

lim  Pr(|a;T  —  x\  <  e)  =  1. 

T  — >oo 
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This  type  of  stochastic  convergence  is  abbreviated  as 
plim  xt  =  x  or  xt  x. 

The  limit  x  may  be  a  fixed,  nonstochastic  real  number  which  is  then  regarded 
as  a  degenerate  random  variable  that  takes  on  one  particular  value  with  prob¬ 
ability  one. 

The  sequence  {xt}  converges  almost  surely  (a.s.)  or  with  probability  one 
to  the  random  variable  x  if  for  every  e  >  0, 

Pr  (  lim  I  Xt  —  x|  <  e)  =  1. 

\T— >oo  / 

This  type  of  stochastic  convergence  is  often  written  as  xt  — >  x  and  is  some¬ 
times  called  strong  convergence. 

The  sequence  {xt}  converges  in  quadratic  mean  or  mean  square  error  to 
x,  briefly  Xt  9—>  x,  if 

lim  E(xt  —  x)2  =  0. 

T  — 

This  type  of  convergence  requires  that  the  mean  and  variance  of  the  xx’  s  and 
x  exist. 

Finally,  denoting  the  distribution  functions  of  Xt  and  x  by  Ft  and  F, 
respectively,  the  sequence  {xy}  is  said  to  converge  in  distribution  or  weakly 
or  in  law  to  x,  if  for  all  real  numbers  c  for  which  F  is  continuous, 

lim  FT(c )  =  F(c). 

T  — >oo 

This  type  of  convergence  is  abbreviated  as  xx  — >  x.  It  must  be  emphasized 
that  we  do  not  require  the  convergence  of  the  sequence  of  p.d.f.s  of  the  xt’s 
to  the  p.cl.f.  of  x.  In  fact,  we  do  not  even  require  that  the  distributions  of 
the  Xt’s  have  p.d.f.s.  Even  if  they  do  have  p.d.f.s,  convergence  in  distribution 
does  not  imply  their  convergence  to  the  p.cl.f.  of  x. 

All  these  concepts  of  stochastic  convergence  can  be  extended  to  se¬ 
quences  of  random  vectors  (multivariate  random  variables).  Suppose  {xt  = 
(x’it,  •  •  • ,  Xkt)'}i  T  =  1,  2, . . . ,  is  a  sequence  of  A'-dimensional  random  vectors 
and  x  =  (xi, . . . ,  Xk)'  is  a  If -dimensional  random  vector.  Then  the  following 
definitions  are  used: 

plim  xt  =  x  or  xt  x  if  plim  XkT  =  xj,  for  k  =  1, . . . ,  K. 

CL .  S .  •  n  CL .  S  .  n  i  -*  T7- 

xt  ^  x  if  XkT  — ) ►  Xk  tor  k  =  1, . . . ,  K . 

xT  x  if  lim£i[(xT  —  x)'{xt  —  x)]  =  0. 

Xt  —>  x  if  lim  Ft(c)  =  F(c)  for  all  continuity  points  of  F. 
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Here  Fx  and  F  are  the  joint  distribution  functions  of  xx  and  x,  respectively. 
Almost  sure  convergence  and  convergence  in  probability  can  be  defined  for 
matrices  in  the  same  way  in  terms  of  convergence  of  the  individual  elements. 
Convergence  in  quadratic  mean  and  in  distribution  is  easily  extended  to  se¬ 
quences  of  random  matrices  by  vectorizing  them.  In  the  following  proposition, 
the  relationships  between  the  different  modes  of  convergence  are  given. 

Proposition  C.l  ( Convergence  Properties  of  Sequences  of  Random  Vari¬ 
ables) 

Suppose  {xt}  is  a  sequence  of  K -dimensional  random  variables.  Then  the 
following  relations  hold: 

\  a.s.  p  d 

(1)  xt  —>  x  =>  xt  — >  x  =>  xt  — >  x. 

(c\  q™-  .  p  .  d 

(2)  xt  — >  x  =>■  xt  — >  x  =>  xt  — >  x. 

(3)  If  x  is  a  fixed,  nonstochastic  vector,  then 

Xt  qCf'  x  <^>  [lim E(xt)  =  x  and  lim  .E{(xt  — Ext)'  (xt~ Ext)}  =  0]. 

(4)  If  x  is  a  fixed,  nonstochastic  random  vector,  then 

p  d 

xt  — ►  x  <^>  xt  — >  x. 

(5)  (Slutsky’s  Theorem)  If  g  :  — »  Km  is  a  continuous  function,  then 

xT  ^  x  =>  g(xT)  g{x)  [plim  g{xT)  =  ff(plim  xT)], 

xT x  =>  g{xT) g{ x), 
and 

xT  ->  x  =>  g\xT)  ->  g{x). 


Proposition  C.2  ( Properties  of  Convergence  in  Probability  and  in  Distribu¬ 
tion) 

Suppose  {xt}  and  {yr}  are  sequences  of  ( K  x  1)  random  vectors,  {AT}  is  a 
sequence  of  ( K  x  K)  random  matrices,  x  is  a  (K  x  1)  random  vector,  c  is  a 
fixed  ( K  x  1)  vector,  and  A  is  a  fixed  ( K  x  K)  matrix. 

(1)  If  plim  xt,  plim  yx,  and  plim  At  exist,  then 

(a)  plim  (xt  ±  yr)  =  plim  xt  ±  plim  j/t! 

(b)  plim  (c'xt)  =  c'(plim  Xt)\ 

(c)  plim  x'TyT  =  (plim  xT)'(plim  yr); 

(d)  plim  AtXt  =  plim  (AT)plim  (xr). 

(2)  If  xt  — >  x  and  plim  (xj  —  yr)  —  0,  then  yr  —*  x. 

(3)  If  xt  x  and  plim  j/t  =  c,  then 
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(a)  xt  ±  dt  — > >  x  ±  c; 

(b)  y'TxT  dx. 

(4)  If  xt  x  and  plim  At  =  A,  then  AtXt  —>  Ax. 

(5)  If  xt  — 1 1  x  and  plim  At  =  0,  then  plim  AtXt  =  0. 

■ 

Proposition  C.3  ( Limits  of  Sequences  of  t  and  F  Random  Variables ) 

(1)  t(T)  -A/" (0, 1) 

(that  is,  a  sequence  of  random  variables  with  t-distributions  with  T  degrees 
of  freedom  converges  to  a  standard  normal  distribution  as  the  degrees  of 
freedom  go  to  infinity). 

(2)  JF(J,T)^ooX2(J)- 


C.2  Order  in  Probability 

Let  {ay}  be  a  sequence  of  real  numbers  and  {by}  a  sequence  of  positive  real 
numbers.  Then  ot  is  said  to  be  of  smaller  order  than  bT  ( =  o(br))  if 
lim-r^oo  ar/^T  =  0  and  ot  is  said  to  be  at  most  of  order  bT  [or  =  0(&t))  if 
there  exists  a  number  c  such  that  for  all  T ,  \ar\fbr  <  c. 

Proposition  C.4  ( Order  of  Convergence  Results ) 

For  sequences  of  real  numbers  {ar},  {^t}  and  sequences  of  positive  real  num¬ 
bers  {ct},  {dr},  the  following  results  hold: 

(1)  ot  =  o(cT),bT  =  o{dr)  =>  ot&t  =  o(crdT),  ot  +  bT  =  o(max[cr,  dr]) 
and  |aT|s  =  o(cf,)  for  s  >  0. 

(2)  ot  =  0(ct),&t  =  0(dr)  =  0(ctcIt ),  =  0(max[cr, dr]) 

and  |aT|s  =  O(c^)  for  s  >  0. 

(3)  CLt  —  o(ct):  t^T  —  0{dr)  =>■  ClTbT  =  o(c71dj'). 


Let  {At  =  (ap-.r)}  be  a  sequence  of  random  (to  x  n)  matrices  and  {6t} 
a  sequence  of  positive  real  numbers.  Then  At  is  said  to  be  of  smaller  order 
in  probability  than  bT  (At  =  op(br))  if  plim  t^oo-^-t /bT  =  0  and  At  is 
said  to  be  at  most  of  order  in  probability  bT  or  bounded  in  probability  by  bT 
(At  =  Op(br))  if,  for  every  e  >  0,  there  exists  a  number  ce  such  that  for  all 
T,  Vv{\aij^r\  >  ce6t}  <  e  for  i  =  1, . . . ,  to,  j  =  1, . . . ,  n.  The  following  results 
hold  for  sequences  of  random  matrices. 

Proposition  C.5  (Order  in  Probability  Results) 

For  sequences  of  random  matrices  of  suitable  fixed  dimensions  {At},  {Bt} 
and  sequences  of  positive  real  numbers  {ct},  {dr}  the  following  results  hold: 
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(1)  At  =  op(cT)i  Bt  =  op(dT)  => 
op (max [ct,  dr]). 

(2)  At  =  Op(ct),Bt  =  Op(dT )  => 

Op(max[cT,  dr])- 

(3)  At  =  op(cT),BT  =  Op(dT)  => 


AtBt  =  Op(cTdj-)  and  AT  +  BT 
AtBt  =  Op  {  ct  d/T)  and  At  +  Bt 
AtBt  —  Op^CTd^  • 


For  the  next  result  see,  e.g.,  Fuller  (1976,  p.  192). 

Proposition  C.6  ( Taylor’s  Theorem  for  Functions  of  Random  Vectors) 

Let  yr  =  (j/it,  •  •  •  •>  Vkt)'  =  a  +  Op(rT)  be  a  K -dimensional  random  vector 
sequence,  where  r-j  =  o(l),  and  let  g  :  18Lk  ->Rbea  function  with  continuous 
partial  derivatives  of  order  two  at  a  =  (ai, . . . ,  ok)' ■  Then 

9(yr)  =  g(a )  +  ^jriVT  -a)  +  Op(r |). 

If  g  has  continuous  partial  derivatives  of  order  three, 

9(vt)  =  g(a)  +  ^riVT  -a)+  ^(yT  -  a)'|^r(2/T  -  a)  +  Op(4). 


C.3  Infinite  Sums  of  Random  Variables 

The  MA  representation  of  a  VAR  process  is  often  an  infinite  sum  of  random 
vectors.  As  in  the  study  of  infinite  sums  of  real  numbers,  we  must  specify 
what  we  mean  by  such  an  infinite  sum.  The  concept  of  absolute  convergence 
is  basic  in  the  following.  A  doubly  infinite  sequence  of  real  numbers  {a^}, 
i  =  0,  ±1,  ±2, . . . ,  is  absolutely  summable  if 

n 

lim  V'  |  a*  | 

n — >-oo  z ' 
i=—n 

exists  and  is  finite.  The  limit  is  usually  denoted  by 

OO 

\ai\~ 

i— — oo 

The  following  theorem  provides  a  justification  for  working  with  infinite  sums 
of  random  variables.  A  proof  may  be  found  in  Fuller  (1976,  pp.  29-31). 
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Proposition  C.7  ( Existence  of  Infinite  Sums  of  Random  Variables) 
Suppose  {cij}  is  an  absolutely  summable  sequence  of  real  numbers  and  {^t}, 
t  =  0,  ±1,  ±2, . . . ,  is  a  sequence  of  random  variables  satisfying 

E{zl)<c ,  t  =  0,±1,±2,..., 

for  some  finite  constant  c.  Then  there  exists  a  sequence  of  random  variables 
{yt},  t  =  0,  ±1,  ±2, . . . ,  such  that 


Eq.m. 

aiZt.-i  — »  yt 

n—>  oo 

i——n 


and,  thus, 


ri 


plim 

n—>  oo 


aiZt-i 

i——n 


=  yt- 


The  random  variables  yt  are  uniquely  determined  except  on  a  set  of  probability 
zero.  If,  in  addition,  the  Zt  are  independent  random  variables,  then 

n 

Ea.s. 

o-iZt-i  -►  Vf 

i——ri 


This  theorem  makes  precise  what  we  mean  by  a  (univariate)  infinite  MA 

OO 

yt  =  y^$jut-i, 

2=0 

where  Ut  is  univariate  zero  mean  white  noise  with  variance  cr^  <  oo.  Defining 
at  =  0  for  i  <  0  and  a*  =  ^  for  i  >  0  and  assuming  that  {a;}  is  absolutely 
summable,  the  proposition  guarantees  that  the  process  yt  is  uniquely  defined 
as  a  limit  in  mean  square,  except  on  a  set  of  probability  zero.  The  latter 
qualification  may  be  ignored  for  practical  purposes  because  we  may  always 
change  a  random  variable  on  a  set  of  probability  zero  without  changing  its 
probability  characteristics.  The  requirement  for  the  MA  coefficients  to  be 
absolutely  summable  is  satisfied  if  yt  is  a  stable  AR  process.  For  instance, 
if  yt  =  ayt~ i  +  ut  is  an  AR(1)  process,  =  a1  which  is  an  absolutely 
summable  sequence  for  |a|  <  1.  With  respect  to  the  moments  of  an  infinite 
sum  of  random  variables  the  following  result  holds: 

Proposition  C.8  ( Moments  of  Infinite  Sums  of  Random  Variables) 

Suppose  zt  satisfies  the  conditions  of  Proposition  C.7,  {a.;}  and  {bi}  are  ab¬ 
solutely  summable  sequences  of  real  numbers, 
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Vt  = 


E 


UiZt-i,  and  xt  =  ^  Mt-i- 


Then 

n 

E(yt)  =  lim  V  aiE(zt-i) 

n— ►  oo  z ' 
i——n 


and 

n  n 

E(ytxt)  =  lim  V'  V'  atbjE{zt-iZt-j) 

i——n  j——n 

and,  in  particular, 

n  n 

E{yj)  =  lim  V'  V'  aiajE{zt-iZt-j). 

i——n j——n 


Proof:  Fuller  (1976,  pp.  32-33).  ■ 

All  these  concepts  and  results  may  be  extended  to  vector  processes.  A 
sequence  of  ( K  x  K)  matrices  {A.;  =  (amrv;)},  *  =  0,  ±1,  ±2, . . . ,  is  absolutely 
summable  if  each  sequence  {amrM},  m.  n  =  i  =  0,±1,±2, ...,  is 

absolutely  summable.  Equivalently,  { A. j}  may  be  defined  to  be  absolutely 
summable  if  the  sequence  {||A;||}  is  summable,  where 


IIAII 


[tr(AjA')]1/2  = 


is  the  Euclidean  norm  of  A,.  To  see  the  equivalence  of  the  two  definitions, 
note  that 


|  Q"mn,i  |  <  M<ii  <££  i  Q'mriji  |  • 

m  n 


Hence, 

OO 

5]  \amn,i\  (C.3.1) 

i— — oo 

exists  and  is  finite  if 

OO 

E  ii^ii  (c-3-2) 

i— — oo 


688  C  Stochastic  Convergence 

is  finite.  In  turn,  if  (C.3.1)  is  finite  for  all  m.,n,  then,  for  all  h , 
h  h 

Y  iia*ii  ^  Y  YEt^amn^ 

i——h  i—  —  h  m  n 

so  that  (C.3.2)  is  finite.  Thus,  the  two  definitions  are  indeed  equivalent. 

Proposition  C.9  ( Existence  of  Infinite  Sums  of  Random  Vectors) 

Suppose  {Ai}  is  an  absolutely  summable  sequence  of  real  (K  x  K )  matrices 
and  {zt}  is  a  sequence  of  if -dimensional  random  variables  satisfying 

E(z'tZt)  <  c,  t  =  0,  ±1,  ±2, . . . , 

for  some  finite  constant  c.  Then  there  exists  a  sequence  of  if-dimensional 
random  variables  {yt}  such  that 

n 

E  q.m. 

AiZt-i  — >  yt. 

n—>  oo 

i=  —  n 

The  sequence  is  uniquely  determined  except  on  a  set  of  probability  zero.  ■ 
Proof:  Analogous  to  Fuller  (1976,  pp.  29-31);  replace  the  absolute  value  by 


This  proposition  ensures  that  the  infinite  MA  representations  of  the  VAR 
processes  considered  in  this  text  are  well-defined  because  it  can  be  shown 
that  the  MA  coefficient  matrices  of  a  stable  VAR  process  form  an  absolutely 
summable  sequence.  With  respect  to  moments  of  infinite  sums,  we  have  the 
following  result. 

Proposition  C.10  ( Moments  of  Infinite  Sums  of  Random  Vectors) 

Suppose  zt  satisfies  the  conditions  of  Proposition  C.9,  {AJ  and  {RJ  are 
absolutely  summable  sequences  of  (I\  x  K)  matrices, 

OO  OO 

yt  —  ^  ^  -AiZt—i  and  Xt  —  ^  ^  Biz-t—i. 

i=— oo  i=— oo 

Then 

n 

E(yt)  =  lim  V]  AiE(zt-i ) 

i——n 

and 

n  n 

E(ytx't)  =  lim  Y  Y  AiE(zt-iz't-j)Bj, 

i——n  j——n 

where  the  limit  of  the  sequence  of  matrices  is  the  matrix  of  limits  of  the 
sequences  of  individual  elements.  ■ 
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Proof:  Along  similar  lines  as  the  proof  of  Fuller  (1976,  Theorem  2.2.2,  pp.  32- 
33).  ■ 

While  we  have  restricted  the  discussion  to  absolutely  summable  sequences 
of  coefficients,  it  may  be  worth  mentioning  that  infinite  sums  of  random  vari¬ 
ables  and  vectors  can  be  defined  in  more  general  terms. 


C.4  Laws  of  Large  Numbers  and  Central  Limit  Theorems 


The  derivation  of  asymptotic  properties  of  estimators  and  test  statistics  is 
largely  based  on  laws  of  large  numbers  (LLNs)  and  central  limit  theorems 
(CLTs)  some  examples  of  which  are  listed  in  the  following.  So-called  weak 
LLNs  specify  conditions  under  which  a  sample  mean  converges  in  probability 
to  the  population  mean  and  strong  LLNs  state  the  corresponding  results  for 
almost  sure  convergence. 

In  stating  some  of  the  results,  martingale  difference  processes  are  useful 
tools.  Suppose  {ay}  (t  =  1,  2, . . . )  is  a  sequence  of  zero  mean  random  vari¬ 
ables  and  let  fit  be  an  information  set  available  at  time  t  which  includes  at 
least  {ay, . . .  ,  ay}  and  possibly  other  random  variables.  The  sequence  {ay}  is 
said  to  be  a  martingale  difference  sequence  with  respect  to  the  sequence  fit 
if  E(xt\fit—i)  =  0  for  all  t  =  2,3, ... .  It  is  simply  referred  to  as  martingale 
difference  sequence  if  E{xt)  =  0  for  t  =  1,2,...,  and  /?(ay  |ay_i, . . . ,  ay)  =  0 
for  t  —  2, 3, ... .  More  generally,  a  sequence  {ay}  of  /^-dimensional  vector  ran¬ 
dom  variables  satisfying  E(a y)  =  0  for  all  t  and  .E(ay|ay_i, . . .  ,  ay)  =  0  for 
t  =  2, 3, . . . ,  is  a  vector  martingale  difference  sequence. 

It  is  sometimes  useful  to  allow  the  ay’s  to  depend  on  the  sample  size.  This 
way  a  different  sequence  for  each  sample  size  T  is  obtained.  Denoting  by  xx,t 
the  t- th  element  of  the  T-th  sequence,  not  just  a  sequence  but  an  array  of 
random  variables  {xr,t}  (t  =  1, 2, . . . ,  T;  T  =  1, 2, ... )  is  obtained.  Such  an 
array  is  called  a  martingale  difference  array  if  E(xT,t)  —  0  for  all  t  and  T  and 
E(xT,t\xT,t.-i,  ■  ■  ■  ,xt, i)  =  0  for  all  t  and  T  >  1.  This  definition  also  applies 
for  vector  arrays. 

The  following  inequality  is  a  useful  device  for  deriving  asymptotic  results. 
It  is  therefore  presented  here  (see,  e.g.,  Fuller  (1976,  Theorem  5.1.1)). 

Proposition  C.ll  ( Chebyshev’s  Inequality) 

Given  r  £  N,  r  >  0,  let  x  be  a  random  variable  such  that  £'(|a:|T’)  exists.  Then, 
for  any  cel  and  e  >  0, 


Pr{|a;  —  c|  >  e}  < 


e(\x~c  n 


The  next  proposition  collects  some  weak  LLNs  (see,  e.g.,  Davidson  (1994, 
Part  IV)). 
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Proposition  C.12  ( Weak  Laws  of  Large  Numbers) 

(1)  (Khinchine’s  Theorem)  (Rao  (1973,  p.  112)) 

Let  {ay}  be  a  sequence  of  i.i.d.  random  variables  with  E{xt)  =  /i  <  oo. 
Then 

1  T 

—  1  \ '  p 

XT  —  T  2_s  x*  A4- 
t=l 

(2)  Let  {ay}  be  a  sequence  of  independent  random  variables  with  E{xt)  = 
/i  <  oo  and  E\xt\1+e  <  c  <  oo  (t  =  1,  2, ... )  for  some  e  >  0  and  a  finite 
constant  c.  Then  xt  /z. 

(3)  (Chebyshev’s  Theorem)  (Rao  (1973,  p.  112)) 

Let  {xt}  be  a  sequence  of  uncorrelated  random  variables  with  E{xt)  = 
/i  <  oo  and  lim-r^oo  E(xt  —  /z)2  =  0.  Then  xt  /z. 

(4)  (Corollary  to  Chebyshev’s  Theorem) 

Let  {a;*}  be  a  sequence  of  independent  random  variables  with  E{xt)  = 
/z  <  oo  and  Var (ay)  <  c  <  oo  (t  =  1,  2, . . . )  for  some  finite  constant  c. 
Then  xt  /z. 

(5)  (LLN  for  Martingale  Differences) 

Let  {ay}  be  a  strictly  stationary  martingale  difference  sequence  with 
E\xt\  <  oo  (t  =  1,2,...).  Then  xt  0. 

(6)  (LLN  for  Martingale  Difference  Arrays) 

Let  {xT.tj  be  a  martingale  difference  array  with  E\xT,t\1+c  <  c  <  oo 
for  all  t  and  T  for  some  e  >  0  and  a  finite  constant  c.  Then  xt  '■= 

(7)  (Stationary  Processes)  (Hamilton  (1994,  Proposition  7.5)) 

Let  {ay}  be  a  stationary  stochastic  process  with  E( xt)  =  /i  <  oo  and 
E[(xt  -  n){xt-j  -  /z)]  =7 j  (t  =  1, 2, ... )  such  that  X7I0  hi\  <  00 ■  Tlien 
xt  q—>  ix  and,  hence,  xt  /a,  and  limT^oo  TE(xt  —  /a)2  =  00  71- 


Notice  that  the  i.i.d.  assumption  in  Khinchine’s  theorem  may  be  replaced 
by  the  requirement  that  moments  exist  of  order  larger  than  one.  In  fact, 
Chebyshev’s  theorem  even  requires  the  existence  of  second  order  moments. 
It  is  actually  sufficient  that  the  variances  of  the  ay  are  bounded.  It  may  be 
worth  noting  that  heterogenous  variances  are  allowed  for  the  weak  LLN  to 
hold,  if  the  variances  are  bounded.  The  last  result  in  the  proposition  shows 
that  uncorrelated  elements  of  the  sequence  under  consideration  are  not  re¬ 
quired.  Actually,  a  martingale  difference  sequence  does  not  necessarily  have 
independent  elements  so  that  for  most  of  the  above  results  independence  of 
the  sequence  elements  is  not  assumed. 

Notice,  that  the  proposition  generalizes  straightforwardly  to  sequences  of 
random  vectors  because  convergence  in  probability  for  a  sequence  of  random 
vectors  is  defined  in  terms  of  convergence  of  the  sequences  of  the  individual 
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elements.  The  following  CLTs  are  stated  for  vector  sequences  and,  of  course, 
hold  for  univariate  sequences  as  special  cases. 

Proposition  C.13  ( Central  Limit  Theorems) 

(1)  (Lindeberg-Levy  CLT) 

Let  {xt}  be  a  sequence  of  A'-dimensional  i.i.d.  random  vectors  with  mean 
/i  and  covariance  matrix  Ex.  Then 

y/r{xT~n)±  Af(0,Ex). 

(2)  (CLT  for  Martingale  Difference  Arrays)  (see  Hamilton  (1994,  Proposition 
7.9)) 

Let  {xT,t  =  (t'ir,t)  •  ■  • ,  XKT,t)'}  be  a  A'-dimensional  martingale  dif¬ 
ference  array  with  covariance  matrices  E{xT,t.x'T  t)  —  Et t  such  that 

T-1  Ym.=i  ^ Tt  — ►  E,  where  E  is  positive  definite.  Moreover,  suppose  that 

T-1  Z)Li  xT,tx'T  t  E  and  E(xiT,tXjT,tXkT,tXiT,t )  <  oo  for  all  t  and  T 
and  all  1  <  i,j,k,l  <  K .  Then 

VTxt  Af(0,E). 

(3)  (CLT  for  Stationary  Processes) 

Let  xt  =  fi  +  J2  JL0  &jut-j  be  a  AT-dimensional  stationary  stochastic  pro¬ 
cess  with  E{xt)  =  n  <  00,  Ill'll  <  00  and  Ut  ~  (0,  Eu)  i.i.d.  white 

noise.  Then 

OO  \ 

0,  E  r*u)  > 

j=- 00  / 

where  Tx(j)  :=  E[{xt  -  /L)(xt-j  -  n)']. 


\Zt(xt  —  h)  A f 


The  results  in  Proposition  C.13  are  just  examples  of  useful  CLTs.  A  variety 
of  similar  results  exists  for  different  sets  of  conditions.  More  discussion  of 
CLTs  and  proofs  can  be  found  in  Davidson  (1994,  Part  V).  For  the  CLT  for 
stationary  processes  see  Anderson  (1971,  Chapters  7  and  8). 

To  derive  the  asymptotic  distribution  of  a  vector  sequence  it  is  actually 
sufficient  to  consider  univariate  series.  This  is  a  consequence  of  the  following 
result. 

Proposition  C.14  ( Cramer-Wold  Device)  (Rao  (1973,  p.  123)) 

Let  Xt  be  a  A'-dimensional  sequence  of  random  vectors  and  x  a  A'-dimensional 
random  vector.  If  c'xt  —>  c'x  for  all  c  £  RA",  then  xt 


x. 
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Therefore,  to  show  asymptotic  normality  of  a  sequence,  \pT{jPr  —  0) 
Af(0,  S),  it  suffices  to  show  for  all  A'-vectors  c  with  c'  Sc  0  0, 


VTc'0t  ~  0 

(c’Sc)1/2 


Af(0,l). 


Hence,  CLTs  for  univariate  series  can  in  fact  be  used  to  show  multivariate 
results. 


C.5  Standard  Asymptotic  Properties  of  Estimators  and 
Test  Statistics 


Suppose  we  have  a  sequence  of  (m  x  n)  estimators  {Bt}  for  an  (m  x  n) 
parameter  matrix  B ,  where  T  denotes  the  sample  sizes  (time  series  lengths) 
on  which  the  estimators  are  based.  For  simplicity  we  will  delete  the  subscript 
T  in  the  following  and  we  will  mean  the  sequence  of  estimators  when  we  use 
the  term  “estimator”. 

The  estimator  B  is  consistent  if  plim  B  =  B.  In  the  related  literature,  this 
type  of  consistency  is  sometimes  called  weak  consistency.  However,  in  this 
text,  we  simply  use  the  term  consistency  instead.  The  estimator  is  strongly 
consistent  if  B  B ,  and  the  estimator  is  mean  square  consistent  if  B  q™'  B. 

By  Proposition  C.l,  both  strong  consistency  and  mean  square  consistency 
imply  consistency. 

Let  P  be  an  estimator  (a  sequence  of  estimators)  of  a  ( K  x  1)  vector  0. 
The  estimator  is  said  to  have  an  asymptotic  normal  distribution  if  Vt(0  — 
0)  converges  in  distribution  to  a  random  vector  with  multivariate  normal 
distribution  Af( 0,  S ),  that  is, 


Vr(p~  P)^N(Q1S). 


(C.5.1) 


In  that  case,  for  large  T,  J\f(0,S/T)  is  usually  used  as  an  approximation  to 
the  distribution  of  p.  Equivalently,  by  the  Cramer- Wold  device  (Proposition 
C.14),  (C.5.1)  may  be  defined  by  requiring  that 


VTd(p-P) 

(. c'Sc )!/2 


A(0,1), 


for  any  ( K  x  1)  vector  c  for  which  c' Sc  0  0-  The  following  proposition  provides 
some  useful  rules  for  determining  the  asymptotic  distributions  of  estimators 
and  test  statistics. 


Proposition  C.15  ( Asymptotic  Properties  of  Estimators) 

Suppose  P  is  an  estimator  of  the  ( K  x  1)  vector  P  with  v/T(/3  —  P)  —>  0,  ^f). 

Then  the  following  rules  hold: 
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(1)  If  plim  A  =  A,  then  VTA(/3  —  /?)  4  A7(0,  ASA')  (see  Schmidt  (1976, 
p.  251)). 

(2)  If  R  ^  0  is  an  (M  x  K)  matrix,  then  \/T(Rf3  -  R[3)  4  M(Q,  RER'). 

(3)  (Delta  method) 

If  g((3)  =  (g\((3),  •  ■  • ,  gm(P)Y  is  a  vector-valued  continuously  differentiable 
function  with  dg/d/3'  Y  0  at  (3,  then 

vf[g0)  -  9m  4  n  (o,  d-^zd-y^P) . 

If  dg/d(3'  =  0  at  (3,  \/r[<7(/3)  —  </(/3)]  4  0.  (See  Serfling  (1980,  pp.  122- 
124)). 

(4)  If  E  is  nonsingular,  T(f3  —  P)'E^1(f3  —  f3)  4  x2(K). 

(5)  If  E  is  nonsingular  and  plim  E  =  E,  then  T(/3  —  /3)1  E~1(/3  —  / 3)  4  y2(A'). 

(6)  If  E  =  QA ,  where  Q  is  symmetric,  idempotent  of  rank  n  and  A  is  positive 
definite,  then  T(f3  —  /3)'A_1(/3  —  (3)  4  %2(?z). 


C.6  Maximum  Likelihood  Estimation 

Suppose  j/i,  2/2,  -  •  •  is  a  sequence  of  AT-dimensional  random  vectors,  the  first 
T  of  which  have  a  joint  probability  density  function  friui,  ■  ■  ■ ,  2/t;  do),  where 
<5o  is  an  unknown  (M  x  1)  vector  of  parameters  that  does  not  depend  on  T .  It 
is  assumed  to  be  from  a  subset  D  of  the  M-dimensional  Euclidean  space  RM . 
Suppose  further  that  /t(-4)  has  a  known  functional  form  and  one  wishes  to 
estimate  ^o- 

For  a  fixed  realization  yi,  • . .  ,yr>  the  function 

1(6)  =  l(6\yi, . . .  ,2/t)  =  fT(yi,  •  •  ■ ,  2/t;  6), 

viewed  as  a  function  of  6,  is  the  likelihood  function.  Its  natural  logarithm 
In  Z ( <5 1  • )  is  the  log-likelihood  function.  A  vector  <5,  maximizing  the  likelihood 
function  or  log-likelihood  function,  is  called  a  maximum  likelihood  (ML)  esti¬ 
mate,  that  is,  if 

1(5)  =  sup  1(6), 

<56  D 

then  <5  is  an  ML  estimate.  Here  sup  denotes  the  supremum,  that  is,  the  least 
upper  bound,  which  may  exist  even  if  the  maximum  does  not.  In  general,  d 
depends  on  2/1,  •  •  • ,  2/t,  that  is,  6  =  5(yi, . . .  ,yr).  Replacing  the  fixed  values 
2/i , . . . ,  2/r  by  their  corresponding  random  vectors,  6  is  an  ML  estimator  of  do 
if  the  functional  dependence  on  2/1,  •  •  ■ ,  2/t  is  such  that  6  is  a  random  vector. 
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If  1(5)  is  a  differentiable  function  of  5,  the  vector  of  first  order  partial 
derivatives  of  In  1(6),  that  is, 

s(d)  =  a  In  1(5)/ 95, 


regarded  as  a  random  vector  (a  function  of  the  random  vectors  y±, . , .  ,  yr),  is 
the  score  vector.  It  vanishes  at  5  =  5  if  the  maximum  of  In  Z(d)  is  attained  at 
an  interior  point  of  the  parameter  space  D.  The  information  matrix  for  5 o  is 
minus  the  expectation  of  the  matrix  of  second  order  partial  derivatives  of  In  l, 
evaluated  at  the  true  parameter  vector  do, 


T(50)  =  -E 


"  d2  In  l 

8585' 

V 

The  matrix 


la(60)  =  lim  X(60)/T, 

T — >-oo 

if  it  exists,  is  the  asymptotic  information  matrix  for  do-  If  it  is  nonsingular, 
its  inverse  is  a  lower  bound  for  the  covariance  matrix  of  the  asymptotic  dis¬ 
tribution  of  any  consistent  estimator  with  asymptotic  normal  distribution.  In 
other  words,  if  d  is  a  consistent  estimator  of  do  with 

Vt(6-60)  ^Ad(0,Tj), 

then  Ta(5o)~1  <  Tj,  that  is,  Uj  —  Xa(6o)~1  is  positive  semidefinite.  Under 
quite  general  regularity  conditions,  an  ML  estimator  d  for  do  is  consistent  and 

Vf(5-50)  ^Af^X^So)-1). 

Thus,  in  large  samples,  d  is  approximately  distributed  as  J\f(5o,Xa(5o)~1  /T). 


C.7  Likelihood  Ratio,  Lagrange  Multiplier,  and  Wald 
Tests 

Three  principles  for  constructing  tests  of  statistical  hypotheses  are  employed 
frequently  in  the  text.  We  consider  testing  of 

H0  :  (p(50)  =  0  against  Hi  :  p(50)  i=  0,  (C.7.1) 

where  do  is  the  true  (M  x  1)  parameter  vector,  as  in  the  previous  section, 
and  ip  :  RM  — >  Rw  is  a  continuously  differentiable  function  so  that  p(5)  is  of 
dimension  (N  x  1).  We  assume  that  [dip/d5'\s0\  has  rank  N.  This  condition 
implies  that  N  <  M  and  the  N  restrictions  for  the  parameter  vector  are 
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distinguishable  in  a  neighborhood  of  6o-  Often  the  hypotheses  can  be  written 
alternatively  as 

Ho  ■  do  =  5(70)  against  H1  :  d0  +  g( 70),  (C.7.2) 

where  70  is  an  (M  —  7V)-dimensional  vector  and  g  :  RM~N  — »  is  a  con¬ 

tinuously  differentiable  function  in  a  neighborhood  of  70  (see  Gallant  (1987, 
pp.  57-58)). 

The  likelihood  ratio  (LR)  test  of  (C.7.1)  or  (C.7.2)  is  based  on  the  statistic 
Aifl  =  2[ln/(J)-lnZ(Jr)], 

where  S  denotes  the  unconstrained  ML  estimator  and  6r  is  the  restricted 
ML  estimator  of  <50,  subject  to  the  restrictions  specified  under  H0 ,  that  is, 
5r  is  obtained  by  maximizing  In  l  over  the  parameter  space  restricted  by  the 
conditions  stated  in  Ho-  Under  suitable  regularity  conditions,  we  have 

A  lr^X2(N).  (C.7.3) 

The  Lagrange  multiplier  (LM)  statistic  for  testing  (C.7.1)  or  (C.7.2)  is  of 
the  form 

A  LM  =  s(6ryi(8r)-1s(8r),  (C.7. 4) 

where  s(8)  denotes  the  score  vector  and  1(6)  the  information  matrix,  as  before. 
In  the  LM  statistic,  both  functions  are  evaluated  at  the  restricted  estimator  of 
<5o  -  Under  Hq,  Xlm  has  an  asymptotic  y2  (iV)-distribution,  if  weak  regularity 
conditions  are  satisfied.  The  name  derives  from  the  fact  that  it  can  be  written 
as 

Alm  =  A'  liSr)-1  %  A,  (C.7. 5) 

. dS  ~6r  \  [  96  U 

where  A  is  the  vector  of  Lagrange  multipliers  for  which  the  Lagrange  func¬ 
tion  has  a  stationary  point  corresponding  to  the  constrained  estimator  (see 
Appendix  A.  14). 

The  equivalence  of  (C.7. 4)  and  (C.7. 5)  can  be  seen  by  recalling  that  the 
constrained  minimum  of  —  In  l  is  attained  at  a  stationary  point  of  the  Lagrange 
function 

C(8,\)  =  -  In  1(6)  +  \'<p(5). 

In  other  words,  6r  satisfies 


-s(6r)'  +  A' 
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The  LM  statistic  is  often  computed  via  an  auxiliary  regression.  To  see  how 
this  can  be  done,  consider  a  normal  regression  model  of  the  form 

y  =  A'/ 3  +  Z7  +  u, 

where  y  and  u  are  (T  x  1)  vectors,  A  and  Z  are  ( T  x  M )  and  ( T  x  N)  regressor 
matrices,  respectively,  f3  and  7  are  (M  x  1)  and  ( N  x  1)  parameter  vectors 
and  u  ~  Af(0,  cr2/r)-  Suppose  we  wish  to  test  the  pair  of  hypotheses 

H0  :  7  =  0  versus  Hi  :  7  ^  0. 


In  this  case,  the  score  vector  is 


(y-Xp-Zj), 


the  inverse  information  matrix  is 


a 


2 

U 


X'X  X'Z 
Z'X  Z'Z 


and  the  restricted  estimator  is 


'3' 

'  (A'A)"1A,y  ‘ 

0 

0 

Notice  that  the  first  order  conditions  for  computing  this  estimator  imply 
X'(y  -  A/3)  =  A'u  =  0. 


Here  u  :=  y  —  X(3  is  the  residual  vector  of  the  restricted  estimation.  Hence, 
the  score  vector  evaluated  at  the  restricted  estimator  is 


s 


3 

0 


1 

'  X'{y  -  XP)  ' 

1 

0 

_  Z'(y  -  XP)  _ 

Z'  u 

and  the  LM  statistic  becomes 
A lm  =  [0  :  u ' Z\ 


'  A' A 

X'Z  " 

-1 

0 

Z'X 

Z'Z 

Z'u 

A 


=  u'Z{Z'Z  -  Z'X(X'X)-1X'Z)-1Z'u/al 


where  the  rules  for  the  partitioned  inverse  have  been  used  (see  Appendix 
A. 10). 

The  same  statistic  is  obtained  by  using  the  usual  y2-statistic  for  testing 
7  =  0  in  the  auxiliary  regression  model 


u  =  A/3  +  Z7  +  e. 
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where  e  is  an  error  vector.  The  LS  estimator  from  this  model  is 


X'X  X'Z  ' 

-1 

'  X'u  ' 

Z'X  Z'Z 

Z'  u 

Using  X'u  =  0  and  the  rules  for  the  partitioned  inverse  gives 
7  =  (Z'Z  -  Z'X(X'X)-1X'Z)-1Z'u 
~  N(ru(jl(Z'Z-Z'X(X'X)-xX'Z)-x). 

Hence,  the  x2-statistic 

7  {Z'Z  -  Z'X{X'X)~1X'Z)^/al 


is  easily  seen  to  be  identical  to  the  previously  obtained  expression  for  A lm- 
Of  course,  algebraically  the  same  result  is  obtained  if  is  replaced  by  an 
estimator.  Using  the  usual  modifications,  the  statistic  has  an  ^-distribution 
in  this  case.  More  precisely, 


7  {Z'Z  -  Z'X(X'X)~~1X'Z) 7 

Nd* 


F(N,T  -  M  -  N). 


Although  we  have  used  a  normal  regression  model  with  nonstochastic  re¬ 
gressors  in  this  illustration,  a  similar  reasoning  often  applies  for  more  general 
situations  and  it  implies  an  auxiliary  regression  model  from  which  the  LM 
statistic  can  be  obtained.  The  reason  is  that  much  of  the  derivation  rests  on 
the  algebraic  properties  of  the  quantities  involved.  Therefore,  similar  argu¬ 
ments  can  be  used,  for  example,  if  the  regressors  are  stochastic  or  a  GLS 
estimation  is  used.  In  Chapters  4  and  5,  the  LM  statistics  for  residual  auto¬ 
correlation  in  VAR  models  are,  for  instance,  derived  in  this  way. 

The  Wald  statistic  is  based  on  an  unconstrained  estimator  which  is  asymp¬ 
totically  normal, 


Vt(6-60)  4a7(0,A~). 

By  Proposition  C.15(3),  it  follows  that 

x/t[^)-v(«5o)]4.A^0, 

Thus,  by  Proposition  C.15(5),  if  :  p(6o)  =  0  is  true  and  the  covariance 
matrix  is  invertible, 


dp 

~dp' 

dF 

V 

~dd 

V 

A  w  =  T<p(S)' 


( 

dp 

'dp' 

dS' 

8- 

^8 

d6 

8- 

<p(5)  4  X\N), 


(C.7. 6) 


where  LA  is  a  consistent  estimator  of  AA.  The  statistic  Aw  is  the  Wald  statistic. 
For  further  discussion  of  the  three  test  statistics  and  proofs  of  their  asymptotic 
distributions  see  also  Hayashi  (2000,  Chapter  7). 
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In  summary,  we  have  three  test  statistics  with  equivalent  asymptotic  distri¬ 
butions  under  the  null  hypothesis.  The  LR  statistic  involves  both  the  restricted 
and  the  unrestricted  ML  estimators,  the  LM  statistic  is  based  on  the  restricted 
estimator  only,  and  the  Wald  statistic  requires  just  the  unrestricted  estimator. 
The  choice  among  the  three  statistics  is  often  based  on  computational  con¬ 
venience.  Wald  tests  have  the  disadvantage  that  they  are  not  invariant  under 
transformations  of  the  restrictions.  In  other  words,  if  the  restrictions  can  be 
written  in  two  equivalent  ways  (e.g.,  <5,  =  0  and  52  =  0)  the  corresponding 
Wald  tests  may  have  different  small  sample  properties.  Their  small  sample 
power  may  be  low  (see  Gregory  &  Veall  (1985),  Breusch  &  Schmidt  (1988)). 


C.8  Unit  Root  Asymptotics 

C.8.1  Univariate  Processes 


In  deriving  asymptotic  results  for  processes  with  unit  roots,  it  is  helpful  to 
consider  also  continuous  stochastic  processes.  An  important  example  is  a  stan¬ 
dard  Brownian  motion  or  a  standard  Wiener  process  W(-)  which  is  a  function 
defined  on  the  unit  interval  [0,1]  and  it  assigns  a  random  variable  W(t)  to 
each  t  £  [0, 1]  such  that  the  following  conditions  hold: 

(1)  W(0)  =  0  with  probability  one. 

(2)  W(t)  is  continuous  in  t  with  probability  one. 

(3)  For  any  partitioning  of  the  unit  interval,  0  <  t\  <  t?  <  ■  •  •  <  tk  <  1,  the 
vector 


W(t2)-w  ih) 

( 

. 

_  W(tfc)-  W(ifc_!)  . 

\ . 

t2  —  t\  0 


0  ...  tk  —  tk- i 


that  is,  the  differences  have  a  multivariate  normal  distribution  with  inde¬ 
pendent  components,  means  of  zero,  and  variances  ti  —  f,_i . 

Wiener  processes  play  an  important  role  in  the  asymptotic  theory  for  unit  root 
processes.  Nonstandard  versions  of  the  type  Z(t)  =  crW(f)  are  often  encoun¬ 
tered.  Their  increments  are  still  independent  but  Z(t)—Z(s)  ~  Af(0,  a2(t  —  s)) 
for  s  <t.  Notice  also  that  Z(t)  ~  Af(0,a2t). 

In  developing  unit  root  asymptotics,  we  are  often  interested  in  quantities 
of  the  form 


XT{r)  = 


1  [Tr] 

tT, 

t=  1 


wt, 


where  wt  is  a  stationary  stochastic  process,  r  £  [0, 1]  denotes  a  fraction  and 
[Tr]  signifies  the  largest  integer  less  than  or  equal  to  Tr.  If  the  wt  =  a*  are 
i.i.cl.  (0,  a2),  we  know  from  a  central  limit  theorem  (see  Proposition  C.13)  that 
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for  every  r  £  [0, 1],  because  yJ\Tr]/VT  — >  ^/r.  Moreover, 

v/T[XT(r2)  -  XT(n)]/a„  -i  A^(0,  r2  -  n). 

for  j*i  <  r2.  For  nonoverlapping  partitions  of  the  unit  interval,  the  partial  sums 
will  be  made  up  of  independent  terms  and  they  are  therefore  independent. 
Hence,  it  is  plausible  to  write 

VfXT(-)/au  4  W(-).  (C.8.1) 

This  notation  and  result  generalizes  the  previously  defined  concept  of  con¬ 
vergence  in  distribution  because  now  convergence  is  stated  for  a  sequence 
of  continuous  time  stochastic  processes.  The  result  is  often  referred  to  as  a 
functional  central  limit  theorem  (FCLT)  or  invariance  principle  or  Donsker’s 
theorem. 

Giving  a  precise  definition  of  the  related  concept  of  convergence  in  dis¬ 
tribution  is  simplified  by  considering  convergence  of  probability  measures.  A 
sequence  of  probability  measures  Pr^  is  said  to  converge  to  the  probability 
measure  Pr  or  Pr t  converges  weakly  to  Pr,  if  Prx(A)  — >  Pr(A)  for  all  mea¬ 
surable  sets  A,  with  the  exception  of  sets  for  which  the  boundary  points  have 
nonzero  probability  mass.  Instead  of  considering  the  distribution  functions, 
we  may  define  convergence  in  distribution  via  weak  convergence  of  the  corre¬ 
sponding  sequence  of  probability  measures.  Thus,  constructing  a  probability 
space  on  a  suitable  space  of  functions  defined  on  the  unit  interval,  the  conver¬ 
gence  in  (C.8.1)  can  be  defined  rigorously.  Although  this  type  of  convergence 
is  more  properly  called  weak  convergence,  we  will  still  use  the  symbol  —>  for 
signifying  it.  For  more  precise  discussions  see,  for  example,  Davidson  (1994, 
2000)  or  Johansen  (1995). 

We  may  also  generalize  the  concept  of  convergence  in  probability  to  the 
case  of  sequences  of  random  functions.  For  a  sequence  Gt(- )  and  a  random 
function  G(-)  we  write  Gt  G  if 

sup  \Gr{t)  —  G(t)|  -^>  0. 

te  [o,i] 

Another  useful  tool  in  dealing  with  unit  root  process  is  the  continuous 
mapping  theorem  which  states  that,  given  a  sequence  of  stochastic  functions 
{(?t(-)},  a  stochastic  function  G(-)  and  a  continuous  functional  g(-)  (a  function 
defined  on  a  space  of  functions),  we  have 

Gt4g  =►  g(GT)^g(G). 

Using  the  FCLT,  this  theorem  implies,  for  instance,  that 
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I'1  f1 

/  Vf  XT(r)dr  ^au  W (r)dr 

Jo  Jo 

because  the  integral  is  a  continuous  functional. 

These  tools  are  useful  in  proving  the  following  proposition  from  Hamilton 
(1994,  Proposition  17.1)  which  summarizes  a  number  of  helpful  results  from 
the  literature,  many  of  which  were  derived,  e.g.,  by  Phillips  (1987). 

Proposition  C.16  ( Properties  of  Random  Walks  and  Related  Quantities) 
Suppose  Xt  =  Xt-\  +  Ut  is  a  random  walk  with  i.i.d.  white  noise,  ut  r\j  (0,<72), 
and  xo  =  0.  Then  the  following  results  hold: 

(1)  -  auW(l)  =  N{Q,ol). 

(2)  T-1  EL i  *t-i ut  -  ia2[W(l)2  -  1]  =  ±a2[X2(  1)  -  1], 

(3)  T~3!2  ELi  tut  ^  ^W(l)  -  au  W(r)dr  =  Af(0,  u2J 3). 

(4)  T~3/2  Ef=i  xt- 1  4  au  J'^W(r)dr  =  N{ 0,  a2J 3). 

(5)  T~2  E^ri  xt-i  au  fo  W(r)2dr. 

(6)  T_s/2  Etli  °u  Jo  rW(r)dr . 

(7)  T 

(8)  71-(n+1)  Ef=1  ->  l/(n  +  1)  for  n  =  0, 1, . . . . 


From  these  results,  the  following  asymptotic  distributions  of  Dickey-Fuller 
(DF)  statistics  for  unit  roots  can  be  derived.  For  details  see,  e.g.,  Hamil¬ 
ton  (1994,  Section  17.4).  It  is  assumed  that  estimation  is  based  on  a  sample 
2/i, ... , yr  and  a  presample  value  2/0  is  also  available. 


Proposition  C.17  ( Asymptotic  Distributions  of  Dickey-Fuller  Test  Statis¬ 
tics) 

(1)  Suppose  p  =  E?=i  Vt-iVt/ Ef=i  Vt-i  is  the  LS  estimator  of  the  coefficient 
p  of  the  AR(1)  process  yt  =  pyt-i  +ut,  t  =  1,2,...,  where  ut  ~  (0,  a2)  is 
i.i.d.  white  noise.  Here  f/0  is  a  fixed  starting  value  or  a  stochastic  variable 
with  a  given  fixed  distribution  (which  does  not  depend  on  the  sample 
size).  Then,  if  p  =  1, 


T(p~  1) 


|[W(1)2~1] 

J„W(r)2* 


and  the  f-statistic 


P-1  «*.  |[W(1)2^1] 

"n  1  i/2 

P  Jo  W (r)2dr 


where  a~  =  T  1  Etli(Pt  —  PUt- 1)2/  Et=i  Pt-i  is  the  usual  LS  estimator 
of  the  variance  of  p. 
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(2)  Suppose  yt  =  p  +  xt,  t  =  1, 2, ,  with  xt  =  pxt-i  +  ut,  where  ut  ~  (0,  cr2) 
is  i.ixl.  white  noise  and  p  is  a  fixed  mean  term.  Moreover,  let  xq  =  0  and 
yo  be  a  fixed  starting  value  or  a  stochastic  variable  with  a  given  fixed 
distribution.  Furthermore,  p  is  the  LS  estimator  of  p  from  a  regression 
Vt  =  v  +  pyt- i  +  Then,  if  p  =  1, 

^  5[W(l)2^l]-W(l)/01W(r)dr 

^k,  _  1}  ^  - i - - Tz- 

J„  W(r)2rfr-  [/^WWdrJ 

and  the  f-statistic 

p-1  d  |[W(1)2  1]  —  W(l)  fg  W(r)dr 

t^~1  d~  (  21 V2’ 

P  fiwW»*.;[/;wW*]  ) 

where  a~  is  the  usual  LS  estimator  of  the  variance  of  p. 

(3)  Suppose  yt  =  v  +  yt-i  +  Ut,  t  =  1,2, ,  where  ut  ~  (0,  a2)  is  i.i.d.  white 

noise  and  v  ^  0  is  a  constant  term.  Moreover,  let  yo  be  a  fixed  starting 
value  or  a  stochastic  variable  with  a  given  fixed  distribution.  Furthermore, 
p  is  the  LS  estimator  of  p  from  a  regression  yt  =  v  +  pyt-i  +  .  Then,  if 

P=  1, 

T3/2(p  —  1)  Af (0, 12(j2/^2) 

and  the  f-statistic 

<?p 

where  is  the  usual  LS  estimator  of  the  variance  of  p. 

(4)  Suppose  yt  =  po  +  Pit  +  xt,  t  =  1,2,...,  with  xt  =  pxt- i  +  «t,  where 
Ut  ~  (0,  a2)  is  i.i.d.  white  noise  and  po  and  pi  are  fixed  intercept  and 
trend  slope  terms.  Moreover,  let  Xq  =  0  and  yo  be  a  fixed  starting  value 
or  a  stochastic  variable  with  a  given  fixed  distribution.  Furthermore,  p  is 
the  LS  estimator  of  p  from  a  regression  yt  =  vq  +  t  +  pyt- 1  +  ut .  Then, 
if  P=  1, 

T(p-  1)  4  a/b 

and 

tp-i  =  ^  a/\/6, 

<Tp 


where 


702  C  Stochastic  Convergence 


and 


W(r)dW(r) 


+12 


[  rW(r)dr - f  W (r)dr  f  W(r)d?’ — -W(l) 

Jo  2  J  o  J  o  2 


W(l)  /  W (r)dr 


b  = 


W(r)2dr  -  12  /  rW{r)dr 


+12  [  W(r)dr  [  rW{r)dr  -4  (  [  W(r)dr 
Jo  Jo  \Jo 

Furthermore,  a~  is  the  usual  LS  estimator  of  the  variance  of  p. 


Obviously,  most  of  the  asymptotic  distributions  obtained  for  p  are  non¬ 
standard  if  p  =  1.  In  fact,  even  the  convergence  rate  of  the  estimator  is  non¬ 
standard.  It  converges  at  a  much  faster  rate  to  its  true  value  of  1  than  usual 
estimators  based  on  stationary  processes.  More  precisely,  p  —  p  =  Op(T^1)  if 
p  =  1  in  Cases  1,  2,  and  4  in  the  proposition,  whereas  in  the  stationary  case 
of  an  AR(1)  process  yt  =  pyt-i  +  Ut,  say,  we  have  for  the  LS  estimator  of  p, 
p  —  p  =  Op(T-1/2).  The  latter  rate  also  holds  if  yt.  is  stationary  and  has  a 
nonzero  mean  term.  In  Case  3  of  Proposition  C.17,  the  convergence  rate  of  p 
is  even  larger  because  in  that  case  the  estimator  is  dominated  by  the  linear 
trend  which  is  generated  by  the  drift  term. 

It  is  important  to  note  that  the  limiting  distributions  in  Cases  1,  2,  and 
4  are  free  of  unknown  nuisance  parameters.  Therefore,  it  is  easy  to  compute 
percentage  points  of  the  limiting  distributions  by  simulation  methods.  To  do 
that,  it  is  strictly  speaking  not  even  necessary  to  know  the  exact  form  of 
the  asymptotic  distributions  of  the  estimators.  It  is  sufficient  to  know  that 
well-defined  asymptotic  distributions  are  obtained  which  do  not  depend  on 
unknown  nuisance  parameters.  Of  course,  there  are  also  situations  when  a 
more  detailed  knowledge  of  the  asymptotic  distributions  and  closed  form  ex¬ 
pressions  are  helpful. 

The  results  of  Proposition  C.16  can  be  generalized  in  different  ways.  First 
of  all,  the  process  xt  may  have  a  more  complicated  dependence  structure. 
In  particular,  the  error  process  may  be  a  stationary  process.  Consider,  for 
instance,  a  process  Xt  =  Xt- 1  +  Wt,  where  Wt  =  YJdj=o®iUt-3  =  )ut  is  a 
stationary  process  with  J2*L0j\9j  \  <  oo  and  ut  ~  (0,cr2)  is  white  noise,  then 
Xt  can  be  rewritten  as 


Xt  =  X0  +  Wi  4 - 1-  u>t  =  x0  +  6>(l)(ui  -I - hut)  +  ^  OjUt-j  -  Wq, 

j= 0 
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where  0(1)  =  9v  9j  =  ~  YT=j+ 1  9^  3  =  °>  ■  ■  ■  >  and  wo  =  EJlo  9ju~i 

contains  initial  values.  Thus,  Xt  is  the  sum  of  a  random  walk,  a  stationary 
process  and  initial  values.  Note  that  the  condition  Ejlo  3 1 9 3  <  00  ensures 
that  Ejlo  \9*j  \  <  OO)  so  that  J2jLo9jut-j  is  indeed  well-defined  according  to 
Proposition  C.7.  Although  the  condition  for  the  is  stronger  than  absolute 
summability,  it  is  satisfied  for  many  processes  of  practical  interest.  For  ex¬ 
ample,  the  MA  representation  of  a  stable  AR.  or  AR.MA  process  satisfies  the 
condition.  The  decomposition  of  Xt  in  a  random  walk,  a  stationary  compo¬ 
nent,  and  initial  values  is  known  as  the  Beveridge-Nelson  decomposition.  It  is 
a  convenient  tool  in  generalizing  the  results  in  Propositions  C.16  and  C.17. 

In  fact,  if  yt  is  a  finite  order  AR  process,  yt  =  a±yt-\  +  ■  ■  ■  +  apyt~p  +  Ut, 
where  Ut  is  again  white  noise,  yt  can  be  rewritten  as 


yt  =  pyt-i  +  'yiAyt-!  - b  jp-1Ayt-p+i  +  ut 


or,  subtracting  yt-i  on  both  sides, 

Ayt  =  {p-  l)yt-i  +  71  Ayt_1  H - b  'yp-1Ayt_p+1  +  ut. 

Estimating  p  or  p—  1  from  these  equations  by  LS,  it  turns  out  that  the  resulting 
estimators  have  the  same  asymptotic  properties  as  in  Proposition  C.17  (see, 
e.g.,  Hamilton  (1994)). 

Another  possible  generalization  of  these  results  may  be  obtained  by  con¬ 
sidering  multivariate  processes.  We  will  tackle  both  generalizations  at  once  in 
the  following. 


C.8. 2  Multivariate  Processes 


For  the  present  purposes,  multivariate  Brownian  motions  or  Wiener  processes 
are  of  central  importance.  The  univariate  definition  can  be  generalized  as 
follows.  A  AT-dimensional  standard  Brownian  motion  or  standard  Wiener  pro¬ 
cess  W(-)  is  a  function  defined  on  the  unit  interval  [0,1]  which  assigns  a 
A'-dimensional  random  vector  W (t)  to  each  t  £  [0, 1]  such  that: 

(1)  W(0)  =  0  with  probability  one. 

(2)  A  realization  W (t)  is  a  continuous  function  in  t  on  the  unit  interval  with 
probability  one. 

(3)  For  any  partitioning  of  the  unit  interval,  0  <  t\  <  t?  <  ■  ■  ■  <  tk  <  1,  the 
vector 


■  W(t2)-W(ti)  ■ 

( 

r°i 

. 

: 

. 

: 

5 

.  W(M  W(tk  a  _ 

V 

.  0. 

—  tl)lK 


0 


(tk  —  tk-l)lK 


that  is,  the  differences  have  multivariate  normal  distributions  with  inde¬ 
pendent  components,  means  of  zero,  and  variances  of  the  form  f,  —  U_i, 
depending  on  their  difference  in  time. 


704  C  Stochastic  Convergence 


Again,  for  any  nonsingular  ( K  x  K )  matrix  P ,  a  nonstandard  version  of  a 
Wiener  process  Z(t)  :=  f’W(t)  is  obtained  for  which  the  increments  are  still 
independent  but  Z(t)  —  Z(s)  ~  7V(0,  (t  —  s)PP')  for  s  <  t.  Moreover,  Z(f)  ~ 
Af(0,  tPP'). 

For  a  sequence  GV(-)  of  multivariate  random  functions,  we  define  conver¬ 
gence  in  probability  to  a  random  function  G,  Gt  4  G,  to  hold  if 

sup  ||Gt(£)  -  G(t)  ||  -4  0. 
te  [0,1]  T^°° 

Also,  the  continuous  mapping  theorem  remains  valid  in  the  multivariate  case. 
As  in  the  univariate  case,  it  is  of  interest  to  consider  quantities  of  the  form 

1  [ Tr ] 

xr{r)  =  - 

t=l 

where  wt  is  a  stationary  stochastic  process,  r  €  [0, 1]  denotes  a  fraction  and 
[Tr]  signifies  the  largest  integer  less  than  or  equal  to  Tr.  If  wt  =  ut  ~  (0,  Su) 
is  i.i.d.  white  noise,  it  follows  from  a  multivariate  version  of  a  suitable  CLT 
(see  Proposition  C.13)  that 

Vf[XT(r2)  -  Xr(n)]  -4  Af( 0,  (r2  -  n)Xu) 
for  r\  <  ?’2.  Hence,  using  the  same  ideas  as  in  the  univariate  case, 

\/ta-1/2xt(-)4w(-), 

which  is  a  multivariate  version  of  the  previously  stated  FCLT  also  referred  to 
as  invariance  principle  or  Donsker’s  theorem. 

If  Xt  =  Xt-i  +  Wt,  where 

oo  oo 

wt  =  5 (L)ut  =  y  3 jUt-j,  with  j||Bj||  <  oo, 

3= 0  j=0 

and  Ut  ~  (0,  £u  =  ((Jij))  is  white  noise,  then  a  multivariate  Beveridge-Nelson 
decomposition  is  available, 

t  OO 

Xt  =  X0  +  w1  H - M«t  =  io  +  3(1)  y]us  +  -  Wq, 

s=l  j=0 

where  3(1)  =  3j,  3*  =  J  =  0,1,...,  and  wg  = 

Xj-n  3 jU-j  contains  initial  values.  Now  is  a  sum  of  a  multivariate  ran¬ 
dom  walk,  a  stationary  process,  and  initial  values  (see  also  Proposition  6.1). 
Using  these  concepts,  the  following  generalized  version  of  Proposition  C.16 
can  be  established.  It  also  goes  back  to  Phillips  and  others  (see  Phillips  & 
Durlauf  (1986),  Park  &  Phillips  (1988,  1989),  Phillips  &  Solo  (1992),  Sims 
et  al.  (1990),  Johansen  (1995))  and  may  be  found,  e.g.,  in  Hamilton  (1994, 
Proposition  18.1). 
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Proposition  C.18  ( Properties  of  Multivariate  Unit  Root  Processes) 

Suppose  xt  =  Xt- 1  +  wt,  t  =  1,  2, . . . ,  is  a  A'-dimensional  generalized  random 
walk  with  initial  vector  xq  =  0  and  stationary  error  term 

OO 

wt  =  S [L)ut  =  ^2  SjUt-j,  t  6  Z, 
j= o 


where 

OO 

£;p,ii  < 00  ’ 

J=0 


and  ut  ~  (0,  A.u  =  (try)),  t  G  Z,  is  i.i.d.  white  noise  with  finite  fourth  moments. 
Let  P  be  a  lower  triangular  matrix  such  that  Su  =  PP' , 

OO 

Pwifi)  ■=  E{wtw't_h )  =  £  Sj+h^Sj-,  h  =  0, 1, 2, . . . , 
i= o 

for  an  arbitrary  positive  integer  n,  Wt  :=  {w't_1, . . . ,  w't_nY  is  a  A'n-dimen- 
sional  vector  with  Pw  :=  E(WtWf),  and  the  ( K  x  K)  matrix  A  :=  S(1)P. 
Then  the  following  results  hold: 

(1)  T-WZtLiWtlAWQ). 

(2)  T-1' 2  ±  Af(0,  <r«2V)  for  *  =  1, ... ,  K. 

(3)  T_1  E?=i  wtw’t_h  P«,(fc)  for  h  =  0, 1,2, - 

(4)  T~'1  +  wt-hx't-i) 

d  I  AW(1)W(1)'A'  -  Pw( 0)  for  h  =  0, 

""  I  AW(i)W(i)M'  -  rw(o)  +  e£I„+1  rw(j)  for  h  =  1, 2, . . . . 

(5)  T"1  Ef=i  -  A  {/0X  W(r)dW(r)'}  yl'  +  E£  /  ,,(./!. 

(6)  T-1  Ef=i  ^  {io  W(r)dW(r)'j  P'. 

(7)  T-3/2  Ef=1  *t-i  -  A  2  W(r)dr. 

(8)  T~3/2  E£  twt-fc  A  |w (1)  -  2  W(r)drJ  for  h  =  0, 1,  2, ... . 

(9)  T-2  Ef=  i  ^  ^  {/o  W(r)W(r)'dr}  A'. 

(10)  T~5/2  ELi  te*-i  71  fo  rW(r)dr. 

(11)  T-3  Ef=1  ±  A  {Jo1  rW(r)W(r)'dr }  yl'. 


These  results  are  the  basis  for  much  of  the  asymptotic  theory  related  to 
multivariate  VAR  processes  with  unit  roots.  Extensions  exist  for  more  general 
processes  wt  and  ut. 


D 


Evaluating  Properties  of  Estimators  and  Test 
Statistics  by  Simulation  and  Resampling 
Techniques 


If  asymptotic  theory  is  difficult  or  only  small  samples  are  available,  properties 
of  estimators  and  test  statistics  are  sometimes  investigated  by  heavy  use  of  the 
computer.  The  idea  is  to  simulate  the  distribution  (or  some  of  its  properties) 
of  the  random  variables  of  interest  by  artificially  sampling  from  some  known 
distribution.  Generally,  if  the  random  variable  or  vector  of  interest,  say  q  = 
q{z),  is  a  function  of  a  random  vector  z  with  a  known  distribution  Fz,  then 
samples  z±, ...  ,zn  are  drawn  from  Fz  and  the  empirical  distribution  of  q  given 
by  qn  =  q(zn)i  n  =  1, . . . ,  N,  is  determined.  The  characteristics  of  the  actual 
distribution  of  q  are  then  inferred  from  the  empirical  distribution. 

Often  the  statistics  of  interest  in  this  book  are  functions  of  multiple  time 
series  generated  by  VAR(p)  processes.  Therefore,  we  will  briefly  describe  in  the 
next  section  how  to  simulate  such  time  series.  Afterwards,  some  more  details 
are  given  on  simulation  and  resampling  techniques  for  evaluating  estimators 
and  test  statistics. 


D.l  Simulating  a  Multiple  Time  Series  with  VAR 
Generation  Process 


To  simulate  a  multiple  time  series  of  dimension  K  and  length  T,  we  first  gener¬ 
ate  a  series  of  (often  independent)  disturbance  vectors  U-s, . . . ,  uq,  u\,  . . . ,  ut- 
If  a  series  of  Gaussian  disturbances  is  desired,  i.e.,  ut  ~  Af(0,£u),  we  may 
choose  K  independent  univariate  standard  normal  variates  V\ , . . . ,  vk  and 
multiply  by  a  ( K  x  I\)  matrix  P  for  which  PP'  =  Eu,  that  is, 


Vi 


Ut  —  P 


vK 


This  process  is  repeated  T  +  s  +  1  times  until  we  have  the  desired  series  of 
disturbances.  Programs  for  generating  (pseudo)  standard  normal  variates  are 
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available  on  most  computers.  Also  facilities  for  generating  random  numbers 
from  other  distributions  are  usually  available  and  may  be  used  in  a  similar 
manner  to  obtain  disturbances  from  other  distributions  of  interest. 

For  a  given  set  of  parameters  Ap,  where  v  is  (K  x  1)  and  the 

Ai  are  ( K  x  K),  and  a  given  set  of  starting  values  . . . ,  y0,  the  ut  may 

be  used  to  simulate  a  time  series  y-i. ,  yr  with  VAR(p)  generation  process 
recursively  as 


Vt  —  v  +  A\yt-\  +  •  •  •  +  Apyt-p  +  ut 


starting  with  t  =  1,  t  =  2,  etc.  until  t  =  T.  There  are  different  ways  to  obtain 
the  initial  values.  Assuming  that  the  desired  process  is  stable,  they  may  be 
set  to  zero  or  to  the  process  mean  y  =  (1K  —  A1  —  •  •  •  —  Ap)-1^.  Because  the 
choice  of  initial  values  has  some  impact  on  the  generated  time  series,  a  number 
of  presample  values  yt,t  =  —s, . . . ,  0,  is  often  generated  and  then  discarded  in 
the  subsequent  analysis. 

A  possible  way  to  ensure  the  same  correlation  structure  for  the  initial 
values  and  the  rest  of  the  time  series  is  to  determine  the  covariance  matrix  of 
p  consecutive  yt  vectors,  say  Sy.  Using  the  results  of  Chapter  2,  Section  2.1, 
that  matrix  may  be  obtained  from 


vec(AV)  =  (I(KP) 2  -  A  (8>  A)  1vec(Su), 
where 


A  = 

‘  A: 
Ik 

0 

a2 

0  ... 
Ik 

Ap— 1 

0 

0 

Ap  1 
0 

0 

and  Su  = 

'  K 
0 

,  0  ... 

0  ... 

0 " 
0 

0 

0  ... 

Ik 

0 

0 

0  ... 

( KpxKp ) 

0  _ 

Then  a  ( Kp  x  Kp)  matrix  Q  is  chosen  such  that  QQ'  =  Sy  and  p  initial 
starting  vectors  are  obtained  as 


yo 

r 1 

. 

: 

=  0 

. 

: 

+ 

.  V-p+ 1  J 

_  VKp  \ 

.  y . 

where  the  vt  are  independent  variates  with  mean  zero  and  unit  variance. 


D.2  Evaluating  Distributions  of  Functions  of  Multiple 
Time  Series  by  Simulation 

Suppose  we  are  interested  in  the  function  qy  =  <7(2/1 , . . . ,  yr)  of  some  VAR(p) 
process  yt ,  where  qy  is  of  dimension  (M  x  1).  The  quantity  qy  may  be  some 
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estimator  or  test  statistic.  To  investigate  the  distribution  Ft  of  qx,  we  gen¬ 
erate  a  large  number,  say  N,  of  independent  multiple  time  series  of  length  T 
and  compute  the  corresponding  values  of  qx,  say  <zt(u),  n  =  1, . . . ,  N.  The 
properties  of  Ft  are  then  estimated  from  the  empirical  distribution  of  the 
qr{.n)-  For  instance,  the  mean  vector  of  qx  is  estimated  as 

1  N 
n= 1 

Analogously,  we  may  estimate  the  variances,  standard  deviations,  quantiles 
or  other  characteristics  of  FT. 


D.3  Resampling  Methods 

If  the  distribution  of  the  disturbances  of  a  VAR  model  under  consideration 
is  unknown,  so-called  bootstrap  or  resampling  methods  may  be  applied  to 
investigate  the  distributions  of  functions  of  stochastic  processes  or  multiple 
time  series.  Suppose  a  time  series  j/i,  ■  • . ,  Vt  and  the  presample  values  required 
for  estimation  are  available.  Fitting  a  VAR(p)  model  to  this  time  series,  we 
get  coefficient  estimates  Ap,  and  a  series  of  residuals  u\% . . . ,  ux-  An 

estimator  of  a  quantity  of  interest,  say  q  =  q(A\, . . . ,  Av ),  is  then  obtained  as 

q-  q(Ah,....,Ap).  (D.3.1) 

The  properties  of  q  follow  from  those  of  A±, . . . ,  Ap.  To  assess  the  sampling  un¬ 
certainty  of  q,  confidence  intervals  are  often  established,  based  on  the  asymp¬ 
totic  distribution  of  q.  Alternatively,  if  q  is  a  test  statistic,  its  p-value  may  be 
of  interest  which  can  be  approximated  on  the  basis  of  the  asymptotic  distribu¬ 
tion.  Unfortunately,  this  distribution  is  often  a  rather  poor  approximation  of 
the  actual  distribution  for  a  given  finite  sample.  In  some  of  these  cases,  boot¬ 
strap  methods  provide  a  better  small  sample  approximation.  The  theoretical 
justification  for  the  bootstrap  also  rests  on  asymptotic  theory,  however.  In 
particular,  it  can  usually  be  justified  if  the  quantity  of  interest  has  a  normal 
limiting  distribution  (Horowitz  (2001)). 

A  residual  based  bootstrap  is  often  used  in  this  context.  Assuming  that  a 
sample  yi, ,  yx  plus  presample  values  as  required  are  available,  it  proceeds 
as  follows: 

(1)  The  parameters  of  the  model  under  consideration  are  estimated.  Let  ut, 
t  =  1, . . . ,  T,  be  the  estimation  residuals. 

(2)  Centered  residuals  u\  —  . . . ,  ux  —  u.  are  computed.  Here  u.  =  T-1 

denotes  the  usual  average.  Bootstrap  residuals  are  then  ob¬ 

tained  by  randomly  drawing  with  replacement  from  the  centered  residu¬ 
als. 
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(3)  Bootstrap  time  series  are  computed  recursively  as 

Dt  =  v  +  ^iVt-i  +  •  •  •  +  Apyl_p  +  ul,  t=l,...,T, 

where  the  same  initial  values  may  be  used  for  each  generated  series, 

(ytp+i,---,Vo)  =  {V-p+ii  Vo)- 

(4)  Based  on  the  bootstrap  time  series,  the  parameters  A\,...,Ap  are  reesti¬ 
mated. 

(5)  Using  the  parameter  estimates  obtained  in  the  previous  stage,  a  bootstrap 
version  of  the  statistic  of  interest,  say  q* ,  is  calculated. 

(6)  These  steps  are  repeated  N  times,  where  N  is  a  large  number. 

There  is  now  a  range  of  other  bootstrap  methods  which  may  have  advan¬ 
tages  in  certain  situations.  For  example,  rather  than  using  a  residual-based 
bootstrap,  a  block  bootstrap  may  be  applied  which  is  based  on  the  original 
observations  rather  than  the  model  residuals  (see,  e.g.,  Li  &  Maddala  (1996) 
for  details).  It  may  be  preferable  if  there  is  uncertainty  regarding  specific  as¬ 
pects  of  the  model  like,  for  instance,  the  VAR  order.  These  methods  are  not 
discussed  here  because  residual  based  bootstraps  are  still  the  most  popular 
methods  in  the  present  context. 

In  the  following,  the  symbol  q  denotes  the  quantity  of  interest  for  which 
a  confidence  interval  is  desired.  Its  estimator  implied  by  the  estimators  of 
the  model  coefficients  and  the  corresponding  bootstrap  estimator  are  denoted 
by  q  and  q*,  respectively.  The  following  bootstrap  confidence  intervals  are 
examples  that  have  been  considered  in  the  literature  in  the  context  of  impulse 
response  analysis  (see,  e.g.,  Benkwitz,  Liitkepohl  &  Wolters  (2001)): 

•  Standard  percentile  interval 

Denoting  by  s*^2  and  sL  ,2j  the  y/2-  and  (1  —  7/2)-quantiles,  respec¬ 
tively,  of  the  N  bootstrap  versions  of  q* ,  the  interval 

cv6.=  [s;/2,^_7/2)]  , 

may  be  set  up.  It  is  the  percentile  confidence  interval  discussed,  e.g.,  by 
Efron  &  Tibshirani  (1993). 

•  Hall’s  percentile  interval 

Hall  (1992)  uses  the  result  that  asymptotically  the  distribution  of  VT(q—q) 
corresponds  to  that  of  VT(q*  —  q),  to  derive  the  interval 

CIH=  • 

Here  t*/2  and  ,2^  are  the  y/2-  and  (1  —  7/2)-quantiles,  respectively, 
of  ( q*  —  q)  and  the  interval  is  obtained  by  pretending  that  these  are  the 
quantiles  of  (q  —  q). 

•  Hall’s  studentized  interval 

A  studentized  statistic  (q  —  g)/(Var(g))1/2  often  results  in  more  precise 
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confidence  intervals  at  least  in  theory.  Using  bootstrap  quantiles  f**2  and 
t* r_  /2)  from  the  distribution  of  (q*  —  g)/(V ar(^))1/2,  an  interval 


CIsh 


« -*"-7/2) 


Var(5), 


can  be  constructed  by  using  these  quantities  in  conjunction  with  (g  — 
g)/(Var(g))1/2.  Here  the  variance  Var(g)  may  be  estimated  from  the  boot¬ 
strap  estimates  of  q, 

Var®  =  jv3ilj  • 

where  N  is  the  number  of  bootstrap  replications  and  q*'1  denotes  the 
value  of  the  statistic  of  interest  obtained  in  the  i-tli  bootstrap  replication. 
Moreover,  the  variances  Var(g*)  may  be  estimated  by  a  bootstrap  within 
each  bootstrap  replication.  In  other  words, 

Var(9*)  =  E  («"’*  -  5**)  > 

i—1 

where  q**’1  is  obtained  by  a  double  bootstrap,  that  is,  pseudo-data  are 
generated  according  to  a  process  obtained  on  the  basis  of  the  bootstrap 
systems  parameters  and  N*  is  the  number  of  bootstrap  replications  within 
each  bootstrap  replication. 

A  number  of  refinements  and  modifications  of  these  intervals  exist  (see 
Hall  (1992)). 

The  bootstrap  confidence  intervals  have  the  property  that  they  attain  the 
nominal  confidence  content  at  least  asymptotically  under  general  conditions. 
Roughly  speaking,  if  VT(q  —  q)  converges  as  T  — >  oo,  VT(q*  —  q)  converges 
to  the  same  limit  distribution  under  suitable  conditions  (e.g.,  Hall  (1992)). 
Therefore  CIH  has  the  correct  size  asymptotically,  that  is,  Pr(g  e  CIH)  —> 
1  —  7  as  T  — »  oo,  under  general  conditions,  and,  hence,  Hall’s  percentile 
method  is  asymptotically  precise.  The  same  holds  for  the  Clyn  interval.  On 
the  other  hand,  to  obtain  such  a  result  for  the  standard  percentile  interval 
CIs,  the  limiting  distribution  of  VT(q  —  q)  has  to  be  symmetric  about  zero. 
For  example,  this  result  holds  if  it  is  zero  mean  normal.  Roughly  speaking, 
CIs  works  with  an  implicit  asymptotic  unbiasedness  assumption  for  q.  If  the 
distribution  of  q  is  not  centered  at  q ,  CIs  will  generally  not  have  the  desired 
confidence  content  even  asymptotically  (see  also  Efron  &  Tibshirani  (1993) 
and  Benkwitz  et  al.  (2000)  for  a  more  detailed  discussion  of  this  point). 

If  q  is  a  statistic  for  which  a  p-value  is  desired,  the  following  method  may 
be  used.  Recall  that  the  p- value  of  a  test  is  the  probability  of  obtaining  a  value 
of  the  test  statistic  greater  than  the  observed  one,  if  the  null  hypothesis  holds. 
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Hence,  the  p- value  may  be  estimated  by  the  proportion  of  bootstrap  values  q* 
exceeding  the  value  of  the  test  statistic  q.  Again,  under  general  assumptions, 
this  estimator  is  consistent. 
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Index  of  Notation 


Most  of  the  notation  is  clearly  defined  in  the  text  where  it  is  used.  The  follow¬ 
ing  list  is  meant  to  provide  some  general  guidelines.  Occasionally,  in  the  text 
a  symbol  has  a  meaning  which  differs  from  the  one  specified  in  this  list  when 
confusion  is  unlikely.  For  instance,  A  usually  stands  for  a  VAR  coefficient 
matrix  whereas  in  the  Appendix  it  is  often  a  general  matrix. 
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General  Symbols 


=> 

<S4> 


G 

C 

u 

n 


E 

n 


q.m. 

-i 

i.i.d. 

lim 

plim 

max 

min 

sup 

In 

exp 

1*1 

K 

T 

R 

Rm 

C 

Z 

N 

1(0 

L 

A 

E 

Var 

Cov 

MSE 

Pr 


equals 

equals  by  definition 
implies 

is  equivalent  to 
is  distributed  as 
element  of 
subset  of 
union 

intersection 

summation  sign 

product  sign 

converges  to,  approaches 

converges  in  probability  to 

converges  almost  surely  to 

converges  in  quadratic  mean  to 

converges  in  distribution  to 
independently,  identically  distributed 
limit 

probability  limit 

maximum 

minimum 

supremum,  least  upper  bound 
natural  logarithm 
exponential  function 
absolute  value  or  modulus  of  z 
dimension  of  a  stochastic  process  or  time 
sample  size,  time  series  length 
real  numbers 

m-dimensional  Euclidean  space 

complex  numbers 

integers 

positive  integers 
indicator  function 
lag  operator 
differencing  operator 
expectation 
variance 

covariance,  covariance  matrix 
mean  squared  error  (matrix) 
probability 
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«■) 

likelihood  function 

In  l 

log-likelihood  function 

h(-) 

approximate  likelihood  function 

In  l0 

approximate  log-likelihood  function 

A LM 

Lagrange  multiplier  statistic 

A Lit 

likelihood  ratio  statistic 

A^ 

Wald  statistic 

Qh 

portmanteau  statistic 

Qh 

modified  portmanteau  statistic 

d.f. 

degrees  of  freedom 

AIC 

Akaike  information  criterion 

FPE 

final  prediction  error  (criterion) 

HQ 

Hannan-Quinn  (criterion) 

SC 

Schwarz  criterion 

Distributions  and  Stochastic  Processes 

(multivariate)  normal  distribution  with  mean 
(vector)  p  and  variance  (covariance  matrix)  E 

X2{m) 

X2-distribution  with  in  degrees  of  freedom 

F(in ,  n) 

E-distribution  with  m  numerator  and  n  denominator 
degrees  of  freedom 

t(m) 

f-distribution  with  m  degrees  of  freedom 

AR 

autoregressive  (process) 

AR(p) 

autoregressive  process  of  order  p 

ARCH 

autoregressive  conditional  heteroskedasticity 

ARMA 

autoregressive  moving  average  (process) 

ARMA(p,  g) 

autoregressive  moving  average  process  of  order  (p,  q) 

ARMAb 

echelon  form  VAR.MA  model 

ARMAe(p!,  . . 

.  ,Pk)  echelon  form  VAR.MA  model  with 

Kronecker  indices  (pi, . . .  ,pk) 

EC-ARMAhe 

error  correction  echelon  form  VAR.MA  model 

garch 

generalized  autoregressive  conditional 
heteroskedasticity 

MA 

moving  average  (process) 

MA(j) 

moving  average  process  of  order  q 

MGARCH 

multivariate  generalized  autoregressive  conditional 
heteroskedasticity 

PAR 

periodic  (vector)  autoregression 

VAR 

vector  autoregressive  (process) 

VAR(p) 

vector  autoregressive  process  of  order  p 

VAR.MA 

vector  autoregressive  moving  average  (process) 

VAR.MA(p,  g) 

vector  autoregressive  moving  average  process 
of  order  (p,  q) 

VECM 

vector  error  correction  model 
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Vector  and  Matrix  Operations 

M' 

transpose  of  M 

Madj 

adjoint  of  M 

M-1 

inverse  of  M 

M+ 

Moore-Penrose  generalized  inverse  of  M 

M_ l 

orthogonal  complement  of  M 

AT1/2 

square  root  of  M 

Mk 

&-th  power  of  M 

MN 

matrix  product  of  M  and  N 

+ 

plus 

- 

minus 

Kronecker  product 

det(M),  det  M  determinant  of  M 

\M\ 

determinant  of  M 

\\M\\ 

Euclidean  norm  of  M 

rk(M),  rk  M 

rank  of  M 

tr(M),  tr  M 

trace  of  M 

vec 

column  stacking  operator 

vech 

column  stacking  operator  for  symmetric  matrices  (stacks 
the  elements  on  and  below  the  main  diagonal  only) 

dtp 

w 

vector  or  matrix  of  first  order  partial  derivatives  of  <p  with 
respect  to  (3 

d2p 

d(3d(3' 

Hessian  matrix  of  <p,  matrix  of  second  order  partial 
derivatives  of  p  with  respect  to  (3 

General  Matrices 

Dm 

(to2  x  | m(m  +  1))  duplication  matrix 

i/n 

(to  x  to)  unit  or  identity  matrix 

A-) 

information  matrix 

W 

asymptotic  information  matrix 

j 

:=  [Ik  :  0  :  •  •  •  :  0] 

K-mn 

(mn  x  mn)  commutation  matrix 

Lm 

(|to(to  +  1)  x  to2)  elimination  matrix 

0 

zero  or  null  matrix  or  vector 
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Vectors  and  Matrices  Related  to  Stochastic  Processes  and  Multiple  Time  Series 

Ut  .K-dimensional  white  noise  process 
Ukt  fc-th  element  of  Ut 


U(k)  '■= 


L  ukT  \ 

U  :=  [ui, . . .  ,uT\ 

u  :=  vec  (U) 


yt  A'-dimensional  stochastic  process 
Ukt  fc-th  element  of  yt 
Vki 

V(k)  ■=  \ 


V  :=  Vt/T,  sample  mean  (vector) 

t- 1 

yt  (h)  h- step  forecast  of  yt+h  at  origin  t 

Y  :=  [j/i, . . . ,  yT\ 


y  :=  vec(F) 


Yt  := 


yt-p+i  or  yt.-p+i 

ut  xt 


yt.-p+i 


Ut-q  +  l 


Xf—s+l 


Zt  := 


Vt—p+i 
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Matrices  and  Vectors  Related  to  VAR  and  VARMA  Representations  and 
VECMs  (Parts  I,  II,  III,  IV) 

Ai  VAR  coefficient  matrix 
A  :=  [Ai, . . . ,  Ap\ 

ol  :=  vec(A) 

Ai  ...  Ap_ i  Ap 

1K  0  0 

A  := 

0  ...  Ik  0 

Ai  ...  Ap_  i  Ap 

Ik  0  0 

An  :=  .  .  (Ap  x  Ap) 

.  0  ...  I K  0 

M1  ...  M,n  Mq  ~ 

0  ...  0  0 

A12  :=  .  .  .  (Ap  x  Kq) 

0  ...  0  0 

A21  :=  0  (Kq  x  Ap) 

A22  :=  ,  °  °  (Kq  x  Kq) 

1K(q-l)  0 

M,  MA  coefficient  matrix 


m  :=  vec[Mi , . . . ,  Mq\ 

M  :=  [m(!  S]  Wp  +  9)xK(p  +  «)) 

—Mi  . . .  —Mq-\  —Mq 
Ik  0  0 

Mn  :=  ...  (Kq  x  Kq) 

0  ...  1K  0 
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A\  . . .  Ap—i  Ap 
0  ...  0  0 

M12  :=  .  .  .  ( Kq  x  Kp) 

0  ...  0  0 

M2i  :=  0  (Kp  x  Kg) 

M22  :=  °  °  (Kp  x  Kp) 

4>t  coefficient  matrix  of  canonical  MA  representation 
lli  coefficient  matrix  of  pure  VAR  representation 
a  loading  matrix  of  VECM 
P  cointegration  matrix 
n  :=  ap' 

r.j  short-run  coefficient  matrix  of  VECM 
Impulse  Responses  and  Related  Quantities 
4>i  matrix  of  forecast  error  impulse  responses 

rri 

$ nL  :=  matrix  of  accumulated  forecast  error  impulse  responses 

2=0 

oo 

Sk,  :=  y.  'P; ,  matrix  of  total  or  long-run  forecast  error  impulse  responses 

i= 0 

Oi  matrix  of  orthogonalized  impulse  responses 

m 

E.m  :=  matrix  of  accumulated  orthogonalized  impulse  responses 

2  =  0 
oo 

Soo  matrix  of  total  or  long-run  orthogonalized  impulse  responses 

2=0 

Wjk,h  proportion  of  ft.-step  forecast  error  variance  of  variable  j,  accounted 
for  by  innovations  in  variable  k 
E  matrix  of  long-run  effects 

E*  matrix  of  transitory  effects 
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Moment  Matrices 
r  :=  plim  ZZ'/T 

r.y(h )  :=  Co v(yt,yt-h)  for  a  stationary  process  yt 
Ry{h)  correlation  matrix  corresponding  to  Py{h) 

Su  :=  E(ut‘u't)  =  Cov(ut),  white  noise  covariance  matrix 

IJy  :=  E[(yt  —  y)(yt,  —  y)']  =  Cov(yt),  covariance  matrix  of  a  stationary 

process  yt 

P  lower  triangular  Choleski  decomposition  of  ZJU 

E&  covariance  matrix  of  the  asymptotic  distribution  of  VT{a  —  a) 
f2(h)  correction  term  for  MSE  matrix  of  h- step  forecast 
Ey(h)  MSE  or  forecast  error  covariance  matrix  of  h- step  forecast  of  yt 
Ey{h)  approximate  MSE  matrix  of  ft-step  forecast  of  estimated  process  yt 
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Autoregressive  representation  of 
VARMA  process 
infinite  order  — ,  425 
pure  — ,  425 

Bayesian  estimation 

—  of  Gaussian  VAR  process,  222-229 

—  of  integrated  systems,  309-315 

—  with  normal  priors,  222-225, 
309-312 

basics  of  — ,  222 


BEKK  model,  565-567 
Beveridge-Nelson  decomposition,  242, 
703 

multivariate  — ,  252,  704 
Bilinear  state  space  model,  624 
Bilinear  time  series  model,  624 
Block  bootstrap,  710 
Bootstrap,  709-712 

—  Hall’s  percentile  interval,  710 
—  Hall’s  studentized  interval,  710 

—  confidence  interval,  710 

—  standard  percentile  interval,  710 
block  — ,  710 

residual  based  — ,  709 
Bootstrap  confidence  interval 
Hall’s  percentile  — ,  710 
Hall’s  studentized  — ,  710 
standard  percentile  — ,  710 
Bottom-up  specification  of  subset  VAR 
model,  210-211 
Bounded  in  probability,  684 
Box-Jenkins  methodology,  493,  495 
Breusch-Godfrey  test  for  autocorrela¬ 
tion,  171-174 
Brownian  bridge,  338 
Brownian  motion,  698 
multivariate  — ,  703 

Causality,  41-51,  102-108,  261-262, 
316-321,  441-444 
in  r-th  moment,  579 

—  in  variance,  579-580 
Granger  — ,  41-51 
instantaneous  — ,  41-51 
multi-step  ,  41-51 
Wold  — ,  359 

CCC  GARCH  model,  568 
Central  limit  theorem,  689-692 

—  for  martingale  difference  arrays, 
691 

—  for  stationary  processes,  691 
Donsker’s  — ,  699 
functional  — ,  699 
invariance  principle,  699 
Lindeberg-Levy  — ,  691 

Chain  rule  for  vector  differentiation, 
665 

Characteristic 

determinant,  652 
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—  value  of  a  matrix,  652 

—  vector  of  a  matrix,  652 
polynomial 

—  of  a  matrix,  652 
Chebyshev’s  inequality,  689 
Chebyshev’s  theorem,  690 
Checking  the  adequacy 

—  of  VAR  models,  157-189 

—  of  VARMA  models,  508-510 

—  of  cointegrated  systems,  344-351 

—  of  dynamic  SEMs,  400-401 

—  of  subset  VAR  models,  212-217 
Checking  the  whiteness 

—  of  VAR  residuals,  157-174 

—  of  VARMA  residuals,  510 
Chi-square  distribution,  678 

noncentral  — ,  679 
Choleski  decomposition,  362 

—  of  a  positive  definite  matrix,  659 
Chow  forecast  test,  183 

Chow  test,  182 

Closed-loop  control  strategy,  388 
CLT,  689 

Cofactor  of  an  element  of  a  square 
matrix,  648 

Cointegrated  process,  244-256 

—  of  order  ( d,b ),  245 
Cointegrated  system,  244-256 

Granger  representation  theorem  for 
— ,  251 

vector  error  correction  model  of  — , 
244-256 

Cointegrated  VAR  process,  256 
checking  the  adequacy  of  — ,  344-351 
estimation  of  — ,  269-309 
forecasting  of  -,  258-261,  315-316 
GLS  estimation  of  — ,  291-294 
Granger-causality  in  — ,  261,  316 
impulse  response  analysis  of  — , 
262-264,  321-322 
least  squares  estimation  of  — , 
286-291 

LS  estimation  of — ,  286-291 
ML  estimation  of  — ,  294-300 
structural  analysis  of  — ,  261  -264, 
316-322 

two-stage  estimation  of — ,  301-302 
Cointegrated  VARMA  process,  515-521 
ARMAflB  form  of — ,  518-519 


EC-ARMAjjb  form  of  -,  519-521 
error  correction  echelon  form  of  — , 
519-521 

estimation  of — ,  521-522 
reverse  echelon  form  of — ,  518-519 
specification  of  cointegrating  rank  of 
— ,  525-526 
Cointegrating 

—  matrix,  256 

—  vector,  245 

Cointegrating  rank,  see  cointegration 
rank 

Cointegration  matrix,  248 
Cointegration  rank,  248 

—  of  VAR  process,  248 

—  of  VARMA  process,  525-526 
LR  test  for  — ,  327-335,  551-552 
maximum  eigenvalue  test  for  — ,  329 
testing  for  — ,  327-343,  551-552 
trace  test  for  — ,  329 

Column  vector,  645 
Common  trend,  245 
Commutation  matrix,  663 
Complex  matrix,  657 
Complex  number 
modulus  of  — ,  652 
Conditional  forecast,  402 
Conditional  likelihood  function,  464 
Conditional  model,  387 
Conditional  moment  profiles,  580-582 
Confidence  interval 

—  for  forecast  error  variance 
components,  114 

—  for  impulse  responses,  112 
Hall’s  percentile  — ,  710 
Hall’s  studentized  — ,  710 
standard  percentile  — ,  710 

Consistency 

super  — ,  288,  301 
Consistent  estimation 

—  of  Kronecker  indices,  501 
of  VAR  order,  148-150,  326 

—  of  white  noise  covariance  matrix, 
76 

Constant  conditional  correlation 
GARCH  model,  568 
Constrained  VAR  models 
linear  constraints,  194-221 
nonlinear  constraints,  221-222 
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Contemporaneous  aggregation  of 
VARMA  process,  440 
Continuous  mapping  theorem,  699 
Convergence 

almost  surely,  682 

—  in  distribution,  682 

—  in  law,  682 

—  in  mean  square  error,  682 

—  in  probability,  681 

—  in  quadratic  mean,  682 
with  probability  one,  682 

strong  — ,  682 
weak  — ,  682 
Cramer- Wold  device,  691 

Data  generation  process,  4 
DCC  GARCH  model,  568 
Decomposition  of  matrices,  656-659 
Choleski  — ,  659 
Jordan  — ,  656-657 
Definite 

—  matrix,  655-656 

—  quadratic  form,  655-656 
Degree 

McMillan  — ,  453 
Degrees  of  freedom,  678 
Determinant  of  a  matrix,  647 
Deterministic  trend,  238 
DGP,  4 

Diagnostic  checking 

—  of  VAR  models,  157-189 

—  of  VARMA  models,  508-510 

—  of  cointegrated  systems,  344-351 

—  of  dynamic  SEMs,  400-401 

—  of  restricted  VAR  models,  212-214 
Diagonal  matrix,  646 
Diagonalization  of  a  matrix,  657-658 
Dickey-Fuller  test,  700 

Difference  operator,  242 
Differencing,  242 

Differentiation  of  vectors  and  matrices, 
664-671 

Direction  matrix,  471 
Discrete  stochastic  process,  3 
Distributed  lag  model,  387,  391-392 
rational  — ,  391-392 
Distribution 

—  multivariate  normal,  677-678 

—  normal,  677-678 


—  of  quadratic  form,  678 
chi-square  — ,  678 

F  — ,  679 

noncentral  F  — ,  680 
noncentral  chi-square  — ,  679 
posterior  — ,  222 
prior  — ,  222 
Distribution  function,  3 
joint  — ,  3 

Donsker’s  theorem,  699,  704 
Drift  of  a  random  walk,  238 
Dummy  variable,  585 
seasonal  — ,  585 
Duplication  matrix,  662 
Dynamic 

—  MIMIC  model,  621 

—  factor  analytic  model,  620 
multipliers,  392 

Dynamic  conditional  correlation 
GARCH  model,  568 
Dynamic  SEM 

checking  the  adequacy  of  — ,  400-401 
estimation  of  — ,  394-400 
final  equations  of  -,  392 
final  form  of  — ,  391 
forecasting  of  — ,  401-406 
conditional  — ,  402 
unconditional  — ,  402 
multipliers  of  — ,  406-408 
optimal  control  of  — ,  408-411 
rational  expectations  in  — ,  392-394 
reduced  form  of  — ,  390 
specification  of  -  ,400-401 
structural  form  of  — ,  390 

EC-ARMAflg  form 

—  of  VARMA  process,  519-521 
estimation  of  — ,  522 
specification  of  — ,  523-526 

Echelon  form 

—  VARMA  representation,  452-453 

—  of  a  VARMA  process,  452-453 
specification  of  — ,  498-507 

Effect  of  linear  transformations 

—  on  MA  process,  435 

—  on  VARMA  orders,  436 

—  on  forecast  efficiency,  439 
Efficiency 

—  of  estimators,  198-200 
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—  of  forecasts,  439 
EGARCH  model,  568 
EGLS  estimation 

—  of  parameters  arranged  equation- 
wise,  200-201 

asymptotic  properties  of  — ,  197-200 
implied  restricted  — ,  197 
restricted  — ,  197-201 
Eigenvalue  of  a  matrix,  652 
Eigenvector  of  a  matrix,  652 
Elimination  matrix,  662 
EM  algorithm,  635 
Empirical  distribution,  707 
generation  of  — ,  707 
Endogenous  variable,  387-390,  613 
Equilibrium,  244-245 
Equilibrium  relation,  245 
Error  correction  echelon  form,  519-521 
estimation  of  — ,  522 
specification  of  — ,  523-526 
Error  correction  model,  246 
Error  process 

—  of  the  measurement  equation  of  a 
state  space  model,  613 

—  of  the  transition  equation  of  a 
state  space  model,  613 

Estimated  generalized  least  squares 
estimation,  see  EGLS  estimation 
Estimation 

—  of  AB-model,  372-375 

—  of  ARMAhb  form,  521-522 

—  of  Blanchard-Quah  model,  376 

—  of  SVAR,  372-376 

—  of  SVAR  with  long-run  restrictions, 
376 

—  of  SVECM,  376-377 

—  of  VAR  models,  69-93,  531-536 

—  of  VARMA  models,  447-487 

—  of  autocorrelations,  157-169 

—  of  autocovariances,  157-169 

—  of  cointegrated  VARMA  process, 
521-522 

—  of  cointegrated  systems,  269-309 

—  of  dynamic  SEMs,  394-400 

—  of  error  correction  echelon  form, 
522 

—  of  integrated  VAR  processes, 
309-315 


—  of  multivariate  GARCH  model, 
569-571 

—  of  periodic  models,  594-598 

—  of  process  mean,  83-85 

—  of  reverse  echelon  form,  521  -522 

—  of  state  space  models,  631-637 

-  of  time  varying  coefficient  models, 
589-591 

—  of  white  noise  covariance  matrix, 
75-77,  197-198 

—  with  linear  restrictions,  195-204 

—  with  nonlinear  restrictions,  222 

—  with  unknown  process  mean,  85 
Bayesian  — ,  222-229 

EGLS  — ,  197 

generalized  least  squares  — ,  291-294 
GLS  — ,  195 

least  squares  — ,  286-291 
LS  — ,  69-82,  197 

maximum  likelihood  — ,  87-93,  200, 
294-300,  589-591 

multivariate  least  squares  — ,  69-82, 
531-536 

preliminary  — ,  474-477 
restricted 
EGLS  — ,  195-200 
GLS  — ,  195-200 
restricted  — ,  195-204,  222 
two-stage  — ,  301-302 
Yule- Walker  — ,  85-86 
Exact  likelihood  function,  458-461 
Exogenous  variable,  387-390 
strictly  — ,  389 
strongly  — ,  388 
super  — ,  388 
systems  with  — ,  388-390 
weakly  — ,  388 
Expectation 
rational  — ,  392 

F-distribution,  679 
noncentral,  680 
Factor  analytic  model 
dynamic  — ,  620 
Factor  GARCH  model,  567 
Factor  loadings,  620 
FCLT,  699 
Filter 

Kalman  — ,  625-631 
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Filtering,  625 
Final  equations  form 

—  VARMA  representation,  452 

—  of  a  dynamic  simultaneous 
equations  model,  392 

specification  of  — ,  494-498 
Final  form  of  a  dynamic  simultaneous 
equations  system,  391 
Final  prediction  error  criterion,  see 
FPE  criterion 

Finite  order  MA  process,  420-423 
Forecast 

—  interval,  39-41 

—  of  VAR  process,  93-102,  536-540 

—  of  VARMA  process,  432-434, 
487-490 

—  region,  39-41 
conditional  — ,  402 
estimated  — ,  93-102,  487-490, 

536-540 

loss  function  for  — ,  32-33 
minimum  MSE  — ,  35-39 
point  — ,  33-39 
unconditional  — ,  402 
Forecast  error,  93-95 
Forecast  error  impulse  responses,  51-56 
Forecast  error  variance  component,  108 
Forecast  error  variance  decomposition, 
63-66,  540-545 

—  of  VAR  process,  63-66 

—  of  cointegrated  system,  264 
asymptotic  distribution  of  — , 

108-118,  205-206,  541-543 
structural  — ,  381-382 
Forecast  interval,  98 
Forecast  MSE  matrix,  434 

approximate  — ,  96-98,  489-490,  536 
Forecast  region,  98 
Forecasting 

—  of  ARCH  process,  561  -562 

—  of  GARCH  model,  561-562 

—  of  VAR  process,  31-41 

—  of  VARMA  process,  432-434 

—  of  cointegrated  system,  258-261, 
315-316 

—  of  dynamic  SEM,  401-406 

—  of  estimated  VAR  process,  93-102, 
204-205 


of  estimated  VARMA  process, 
487-490 

—  of  infinite  order  VAR  process, 
536-540 

—  of  integrated  system,  258-261, 
315-316 

—  of  restricted  VAR  process,  204-205 
FPE  criterion,  146 

Fully  modified  VAR  estimation,  318 
Functional  central  limit  theorem,  699 
multivariate  — ,  704 

GARCH  model,  557-584 
asymmetric  — ,  568 
CCC  — ,  568 

constant  conditional  correlation  — , 
568 

dynamic  conditional  correlation  — , 
568 

exponential  — ,  568 
factor  — ,  567 

generalized  orthogonal  — ,  568 
interpretation  of  — ,  579-582 
multivariate  — ,  562-584 
univariate  — ,  559-562 
GARCH  process,  557-584 
forecasting  of  — ,  561-562 
Gaussian  likelihood  function 

—  of  MA  process,  458-463 

—  of  VAR  process,  87-89 

—  of  VARMA  process,  463-467 

—  of  cointegrated  process,  294 

—  of  state  space  model,  631-633 
Gaussian  process 

VAR,  16 
VARMA,  423 
white  noise,  75 

Generalized  autoregressive  conditional 
heteroskedasticity,  559 
Generalized  impulse  responses,  580-582 
Generalized  inverse  of  a  matrix,  650 
Generalized  orthogonal  GARCH  model, 
568 

Generating  process  of  a  time  series,  4 
Generation  process  of  a  time  series,  4 
Global  identification,  633-634 
Globally  identified  model,  633-634 
GLS  estimation,  195-200 

—  of  cointegrated  system,  291-294 


Subject  Index  753 


asymptotic  properties  of  — ,  197 
Gradient 

algorithm,  469 

—  of  log-likelihood  function,  635 

—  of  vector  function,  469 
Granger  representation  theorem,  251 
Granger-causality 

—  in  VAR  models,  41-51,  102-104 

—  in  VARMA  models,  441-444 

—  in  cointegrated  system,  261-262 

—  in  cointegrated  systems,  316-321 
characterization  of  ,  316 

lag  augmentation  test  for  — ,  318 
lag  augmented  Wald  test  for  — ,  318 
test  for  — ,  102-104,  316-321 
Wald  test  for  — ,  102-104,  316-321 

Hall’s  percentile  confidence  interval,  710 
Hall’s  studentized  confidence  interval, 
710 

Hannan-Kavalieris  procedure,  503-505 
Hannan-Quinn  criterion,  see  HQ 
criterion 

Hessian  matrix,  665 
HQ  criterion,  150,  208 

Idempotent  matrix,  653 
Identification 

—  of  VARMA  model,  447-458 

—  of  VARX  model,  400 

—  of  dynamic  simultaneous  equations 
system,  400 

—  of  state  space  model,  633-634 
global  — ,  634 

local  — ,  634 

Identification  problem,  447-458 
Identified  model 
globally  — ,  634 
locally  — ,  634 
state  space,  633-634 
VARMA,  447-458 
Identity  matrix,  646 
Impact  multiplier,  61 
Impulse  response  analysis,  377-382 

—  of  VAR  model,  51-63 

—  of  VARMA  model,  444,  490 

—  of  cointegrated  system,  262-264, 
321-322 

Impulse  responses,  51-63,  377-382 


—  of  VARMA  model,  444,  490 

—  of  cointegrated  system,  262-264, 
321-322 

accumulated  — ,  55 
asymptotic  distribution  of  — , 
108-118,  205-206,  541-543 
estimation  of  — ,  108,  205-206 
forecast  error  — ,  51-56 
generalized  — ,  580-582 
orthogonalized  — ,  56-62,  359 
structural  — ,  359,  377-382 
total  — ,  56 
Indefinite 

—  matrix,  656 

—  quadratic  form,  656 
Index  model,  222 

Infinite  order  MA  representation 

—  of  a  VARMA  process,  423 

—  of  a  time  varying  coefficient 
process,  587 

Infinite  order  VAR  representation 

—  of  a  VARMA  process,  425 

—  of  an  MA  process,  420 
Information  matrix 

—  of  VAR  process,  90 

—  of  VARMA  process,  472-474 

—  of  state  space  model,  635 

—  of  time  varying  coefficient  VAR 
model,  591 

Initial 

—  input,  613 

—  state,  613 
Innovations 

structural  — ,  359 
Input 

—  matrix  of  a  state  space  model,  613 

—  variables,  388 
observable  — ,  388,  613 
unobservable  — ,  388 

Inputs  of  a  state  space  model,  613 
Instantaneous  causality 

—  in  VAR  models,  41-51 
tests  for  — ,  104-108 

Instrument 

—  variable,  388,  613 
observable  — ,  388,  613 

Integrated 

—  of  order  d,  242 

—  process,  237-244 
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—  time  series,  237-244 

—  variable,  237-244 
Integration 

order  of  — ,  242 
Interim  multipliers,  56,  392 
Interpretation 

—  of  ARCH  model,  579-582 

—  of  GARCH  model,  579-582 

—  of  VARMA  model,  441-444 
classical  versus  Bayesian  — ,  228-229 

Interval  forecast,  39-41,  98 
Intervention 

—  in  intercept  model,  604-606 
testing  for  — ,  605-606 

Intervention  model,  586 
estimation  of  — ,  604-608 
specification  of  — ,  604-608 
Invariance  principle,  699 
multivariate  — ,  704 
Inverse  of  a  matrix,  649 
Invertible 

—  MA  operator,  422 
—  MA  process,  420-422 
VARMA  process,  425 

—  matrix,  649 
IS-LM  model,  366 

Iterative  optimization  algorithm 
EM  algorithm,  635 
Newton  algorithm,  471 
scoring  algorithm,  472,  634-636 

Jarque-Bera  test,  175 
Jordan  canonical  form,  657 

Kalman  filter,  625-631 

—  correction  step,  627 

—  forecasting  step,  627 

—  gain,  627 

—  initialization,  627 

—  prediction  step,  627 
recursions,  626-630 

—  smoothing  step,  630 
Kalman  gain,  627 

Kalman  smoothing  matrix,  630 
Khinchine’s  theorem,  690 
Kronecker  indices,  453 

—  of  VARMA  process,  453 

—  of  cointegrated  VARMA  process, 
518 


—  of  echelon  form,  453 

—  of  reverse  echelon  form,  518 
determination  of  — ,  498-507 
estimation  of  — ,  498-507 
specification  of  — ,  498-507 

Kronecker  product,  660 
Kurtosis 

asymptotic  distribution  of  — ,  175, 
178 

measure  of  multivariate  — ,  174-180 

Lagrange  function,  671,  695 
Lagrange  multiplier  statistic,  508-510, 
600-601 

asymptotic  distribution  of  — ,  510, 
601 

Lagrange  multiplier  test,  508-510, 
600-601,  694-698 
Lagrange  multipliers,  671 
Law  of  large  numbers,  689-692 

—  for  martingale  difference  arrays, 
690 

for  martingale  difference  sequence, 
690 

—  for  stationary  processes,  690 
strong  — ,  689 

weak  — ,  689 
Least  squares  estimation 

—  of  VAR  process,  69-82,  531-536 

—  of  cointegrated  VAR  process, 
286-291 

—  with  mean-adjusted  data,  82-85 
asymptotic  properties  of  — ,  72-77, 

197-200,  532-533 
multivariate  — ,  69-82,  531  -536 
restricted  — ,  197-200 
small  sample  properties  of  — ,  80-82 
Least  squares  estimator  of  white  noise 
covariance  matrix,  75-77,  535-536 
asymptotic  properties  of  — ,  75, 
535-536 

Left-coprime  operator,  452 
Leptokurtosis,  560 
Leverage  effect,  568 
Likelihood  function,  693 

—  of  MA  process,  458-463 

—  of  VAR  process,  87-89 

—  of  VARMA  process,  463-467 

—  of  cointegrated  process,  294 
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—  of  state  space  model,  631-633 

—  of  time  varying  coefficient  VAR 
model,  589 

conditional  — ,  464 
Likelihood  ratio  statistic 

asymptotic  distribution  of  — ,  140 
definition  of  — ,  138 
Likelihood  ratio  test,  694-698 

—  for  cointegration  rank,  327-343, 
551-552 

—  of  linear  restrictions,  138-143 

—  of  periodicity,  598 

—  of  varying  coefficients,  595-598 

—  of  zero  restrictions,  138-143 
Lindeberg-Levy  central  limit  theorem, 

691 

Linear  constraints 

—  for  VAR  coefficients,  194  195 
Linear  system,  387 

Linear  transformation 

—  of  MA  process,  435-436 

—  of  VARMA  process,  436-440 

—  of  multivariate  normal  distribution, 
678 

Linearly  dependent  vectors,  652 
Linearly  independent  vectors,  652 
Litterman  prior 

for  nonstationary  process,  310-315 

—  for  stationary  process,  225-227 
LLN,  689 

LM  test,  695 

—  for  autocorrelation,  171-174 
Loading  matrix,  248 

Locally  identified  model,  634 
Log-likelihood  function,  693 
Lomnicki-Jarque-Bera  test,  175 
Long-run 

—  effect,  392 

—  multiplier,  392 
Loss  function,  32-33 

quadratic  — ,  409 
LR  test,  695 

LS  estimation,  see  least  squares 
estimation 

MA  operator,  422 
MA  process 

autocovariances  of  — ,  422 
finite  order  — ,  420-423 


invertible  — ,  420-422 
likelihood  function  of  — ,  458-463 
MA  representation 

—  of  a  VARMA  process,  423 
canonical  — ,  426 

forecast  error  — ,  426 
prediction  error  — ,  426 
MA  representation  of  VAR  process, 
18-24 

Martingale  difference  array,  689 
law  of  large  numbers  for  — ,  690 
Martingale  difference  sequence,  689 
law  of  large  numbers  for  — ,  690 
vector  — ,  689 
Matrix,  645 

—  addition,  646 
differentiation,  664-671 

—  multiplication,  646 

—  multiplication  by  a  scalar,  646 

—  operations,  646-647 

—  rules,  645-675 

—  subtraction,  646 
operator 

left-coprime  — ,  452 
unimodular  — ,  452 
adjoint  of  -,  649 

characteristic  determinant  of  — ,  652 
characteristic  polynomial  of  — ,  652 
characteristic  root  of  — ,  652 
characteristic  value  of  — ,  652 
characteristic  vector  of  — ,  652 
Choleski  decomposition  of  — ,  659 
cofactor  of  an  element  of  -,  648 
column  dimension  of  — ,  645 
commutation  — ,  663 
conformable  -  ,  647 
decomposition  of  — ,  656-659 
determinant  of  — ,  647 
diagonal  — ,  646 
diagonalization  of  — ,  657-658 
duplication  — ,  662 
eigenvalue  of  — ,  652 
eigenvector  of  — ,  652 
element  of  — ,  645 
elimination  — ,  662 
full  rank  — ,  652 
generalized  inverse  of  — ,  650 
Hessian  — ,  665 
idempotent  — ,  653 
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identity  — ,  646 

indefinite  — ,  656 

information  — ,  694 

inverse  of  — ,  649 

invertible  — ,  649 

Jordan  canonical  form  of  — ,  657 

lower  triangular  — ,  646 

minor  of  an  element  of  — ,  648 

Moore-Penrose  inverse  of  — ,  650 

negative  definite  — ,  656 

negative  semidefinite  -  ,  656 

nilpotent  — ,  653 

nonsingular  — ,  649 

null  — ,  646 

orthogonal  — ,  654 

orthogonal  complement  of  654 

partitioned  — ,  659 

positive  definite  — ,  655 

positive  semidefinite  — ,  655 

rank  of  — ,  652 

regular  — ,  649 

row  dimension  of  — ,  645 

square  — ,  645 

square  root  of  — ,  658 

symmetric  — ,  646 

trace  of  — ,  653 

transpose  of  — ,  646 

triangular  — ,  646 

typical  element  of  — ,  645 

unit  — ,  646 

upper  triangular  — ,  646 
zero  — ,  646 

Maximum  eigenvalue  test  for  cointegra¬ 
tion  rank,  329 

Maximum  likelihood  estimation,  see 
ML  estimation 

McMillan  degree 

—  of  VARMA  process,  453 

—  of  echelon  form,  453 

Mean  squared  error  matrix,  see  MSE 
matrix 

Mean  vector  of  a  VAR  process,  82 

Mean-adjusted 

VAR  process,  82 

—  process,  82 

Measurement 

—  equation  of  state  space  model,  611, 
613 

—  errors,  613 


—  matrix,  613 
MG  ARCH,  562-584 
MIMIC  models,  621 
Minimization 

algorithms,  469-472 
iterative  — ,  469-472 
numerical  — ,  469-472 
Minimum  MSE  forecast,  35-39 
Minor  of  an  element  of  a  square  matrix, 
648 

ML  estimates 

computation  of  -,  89-90,  467-477, 
631-637 

ML  estimation,  693 

of  AB-model,  372-375 
of  Blanchard-Quah  model,  376 

—  of  SVAR,  372-376 

—  of  SVECM,  376-377 

—  of  VAR  process,  87-93 

—  of  VAR  process  with  time  varying 
coefficients,  589-591 

—  of  VARMA  process,  458-487 

—  of  cointegrated  system,  294-300 

—  of  periodic  VAR  process,  594-598 

—  of  restricted  VAR  process,  200 

—  of  state  space  model,  631-637 
quasi  -,  140 

Model  checking 

—  of  VAR  models,  157-189 

—  of  VARMA  models,  508-510 

—  of  cointegrated  systems,  344-351 

—  of  dynamic  SEMs,  400-401 

—  of  restricted  VAR  models,  212-217 

—  of  state  space  models,  639 

—  of  subset  VAR  models,  212-217 
Model  selection 

—  of  VAR  models,  135-157 

—  of  VARMA  models,  493-508 

—  of  cointegrated  processes,  325-344 

—  of  subset  VAR  models,  206-212 
Model  specification 

—  of  VAR  models,  135-157 

—  of  VARMA  models,  493-508 

—  of  cointegrated  processes,  325-344 

—  of  dynamic  SEMs,  400-401 

—  of  periodic  VAR  models,  594-604 

—  of  subset  VAR  models,  206-212 
Model  specification  criteria 

AIC,  147,  208 
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FPE,  146 
HQ,  150,  208 
SC,  150,  208 

Modified  portmanteau  statistic,  174, 
214 

approximate  distribution  of  — ,  174, 
214 

Modified  portmanteau  test,  174,  214, 
510 

Modulus  of  a  complex  number,  652 
Moore-Penrose  (generalized)  inverse, 
650 

Moving  average  process,  see  MA  process 
Moving  average  representation  of  VAR 
process,  18-24 
MSE  matrix,  434 

approximate  — ,  96-98,  489-490,  536 
MSE  of  forecast,  96-98,  434,  489-490, 
536 

Multi-step  causality 

—  in  VAR  models,  41-51 
tests  for  — ,  105-108 

Multiplicative  operator,  221-222 
Multiplier 

—  analysis,  392,  406-408 
dynamic  — ,  392 
impact  — ,  61 

interim  — ,  392 
long-run  — ,  392 
total  — ,  392 

Multivariate  ARCH  model,  563-564 
interpretation  of  — ,  579-582 
Multivariate  Beveridge-Nelson  decom¬ 
position,  252 

Multivariate  GARCH  model,  562-584 
BEKK,  565-567 
estimation  of  — ,  569-571 
interpretation  of  — ,  579-582 
Multivariate  least  squares  estimation, 
69-86 

—  of  VAR  process,  69-82 

—  of  infinite  order  VAR  process, 
531-536 

Multivariate  normal  distribution, 
677-678 

linear  transformation  of  — ,  678 
Multivariate  stochastic  process 
discrete  — ,  3 


Negative  definite 

—  matrix,  656 

—  quadratic  form,  656 
Negative  semidefinite  matrix,  656 
Newton  algorithm,  471 
Newton-Raphson  algorithm,  471 
Nilpotent  matrix,  653 
Noncentral  F-distribution,  680 
Noncentral  chi-square  distribution,  679 
Noncentrality  parameter,  679 
Nonlinear 

—  parameter  restrictions,  221-222 

—  state  space  model,  623-625 
Nonnormality 

tests  for  — ,  174-180 
Nonsingular  matrix,  649 
Nonstationary 

VAR  process,  242,  256,  585-586 

—  process,  237,  585-586,  614, 
621-623 

—  time  series,  237 
Normal  distribution 

—  multivariate,  677-678 
Normal  equations 

—  for  VAR  coefficient  estimates,  71 

—  for  VAR  process  with  time  varying 
coefficients,  589 

—  for  VARMA  estimation,  467-469 
Normal  prior,  222-225 

—  p.d.f.,  222 

Normal  prior  for  Gaussian  VAR  process, 
222-225,  309 

Observable 

input,  388 

—  output,  388 

—  variables,  388 
Observation 

—  equation  of  state  space  model,  611, 
613 

—  error,  611,  613 

—  noise,  613 
Open-loop  strategy,  411 
Operator 

left-coprime  — ,  452 
MA  — ,  422 
unimodular  — ,  452 
Optimal  control,  408-411 
closed- loop  — ,  411 
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open-loop  — ,  411 
problem  of  410 
Optimization 

algorithms,  469-472 

—  of  vector  functions,  671-675 
Order  determination 

—  for  VAR  process,  135-157 

—  for  cointegrated  process,  325-327 
criteria  for  — ,  146-157 

tests  for  — ,  136-145 
Order  estimation 

for  cointegrated  processes,  325-327 

—  of  VAR  process,  146-157 
consistent  -  ,  148-150 
criteria  for  — ,  146-157 

Order  in  probability,  684-685 
Order  of 

—  MA  process,  420 
VAR  process,  136 
—  VARMA  process,  423 
Orthogonal 

—  matrix,  654 

—  vectors,  654 

Orthogonal  complement  of  a  matrix, 
654 

Orthogonalized  impulse  responses, 
56-62 

accumulated  — ,  108 
Orthonormal  vectors,  654 
Outlier,  609 
Output 

—  of  a  state  space  system,  613 
observable  — ,  388 

Partial  model,  387 
Partitioned  matrix,  659 
rules  for  — ,  659-660 
Period  of  a  stochastic  process,  591 
Periodic  VAR  process 

definition  of,  591-594 

—  estimation  of,  594-598 

—  specification  of,  594-604 
Permanent  shock,  369 
Point  forecast,  33-39 
Policy 

—  simulation,  406 

—  variable,  613 

Portmanteau  statistic,  169-171,  214, 
510 


approximate  distribution  of  — ,  169, 
214,  510 

modified  — ,  171,  214,  510 
Portmanteau  test,  169-171,  214,  510 
modified  — ,  171,  214,  510 
Positive  definite 

—  matrix,  655 

—  quadratic  form,  655 
Positive  semidefinite  matrix,  655 
Poskitt’s  procedure,  505-507 
Posterior 

—  density,  222 

—  mean,  222 

—  p.d.f.,  222 
Postmultiplication,  647 
Predetermined  variable,  388 
Prediction  tests  for  structural  change 

based  on  one  forecast,  184-186 
based  on  several  forecasts,  186-188 

—  for  cointegrated  systems,  349-351 

—  of  VAR  processes,  184-189 

—  of  VARMA  processes,  510 
Preliminary  estimation  of  VARMA 

process,  474-477 

Preliminary  estimator  of  VARMA 
process,  475-477 
Premultiplication,  647 
Probability  space,  2 
Process 

cointegrated  — ,  244-256 
invertible  MA  — ,  420-422 
invertible  VARMA  — ,  425 
periodic  — ,  586 
stable  VARMA  — ,  423 
VAR  — ,  5 

VARMA  — ,  423-426 
Product  rule  for  vector  differentiation, 
665 

Pure  MA  representation  of  a  VARMA 
process,  423 

Pure  VAR  representation  of  a  VARMA 
process,  425 

Quadratic  form,  655 
distribution  of  — ,  678 
indefinite  — ,  656 
independence  of  — ,  679 
negative  definite  — ,  656 
negative  semidefinite  — ,  656 
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positive  definite  — ,  655 
positive  semidefinite  — ,  655 
Quasi  ML  estimator,  140 

Random  coefficient  VARX  model, 
621-623 

Random  variable,  2 
Random  vector,  3 
Random  walk,  237 
Random  walk  with  drift,  238 
Rank  of  a  matrix,  652 
Rank  of  cointegration 

LR  test  for  — ,  327-343,  551-552 
testing  for  — ,  327-343,  551-552 
Rational 

—  distributed  lag  model,  391-392 

—  expectations,  392 

—  transfer  function,  392 

—  transfer  function  model,  392 
Real  matrices,  645 
Recursions 

Kalman  filter  — ,  626-630 
Recursive  computation 

—  of  derivatives,  467-468 

—  of  residuals,  478 

Reduced  form  of  a  dynamic  SEM,  390 
Regular  matrix,  649 
Resampling,  709-712 
Resampling  technique,  709-712 
Residual  autocorrelation 

—  of  VAR  process,  161  -169,  212-213 

—  of  VARMA  process,  510 
asymptotic  properties  of  — ,  166, 

212-213 

estimation  of — ,  161  169,  212-213 
Residual  autocovariance 

—  of  VAR  process,  161  -169,  212-213 

—  of  VARMA  process,  510 
asymptotic  properties  of  -,  165, 

212-213 

estimation  of  — ,  161  169,  212-213 
Residual  based  bootstrap,  709 
Residuals  of  VAR  process 

checking  the  whiteness  of — ,  157-174, 
214 

Residuals  of  VARMA  process 
checking  the  whiteness  of  — ,  510 
estimation  of  — ,  475 


Restricted  estimation  of  VAR  models, 
195-204 

asymptotic  properties  of — ,  197-201 

EGLS,  197-200 

GLS,  195-197 

LS,  197 

ML,  200 

Restrictions  for  VAR  coefficients 

—  for  individual  equations,  200-201 
linear  — ,  194-195 

nonlinear  -  ,  221-222 
tests  of—,  104  108,  138-143 
Wald  test  of  — ,  104  108 
zero  — ,  206-212 

Restrictions  for  VARMA  coefficients 
Granger-causality  — ,  441-444 
identifying  — ,  452-454 
linear  — ,  464 
LM  test  of  — ,  508-510 
tests  of  — ,  508-510 
Restrictions  on  white  noise  covariance, 
202-204 

Reverse  echelon  form,  518-519 
estimation  of — ,  521-522 
Row  vector,  645 

Sample 

—  autocorrelations,  159 

—  autocovariances,  157 

—  mean,  83-85 
SC,  150,  208 

Schwarz  criterion,  see  SC 
Score  vector,  694 

Scoring  algorithm,  374,  472,  634-636 
Seasonal 

—  dummies,  585 

—  model,  585 

—  operator,  221 

—  process,  585 

—  time  series,  585 

Second  order  Taylor  expansion,  671 
SEM,  387 

Sequential  elimination  of  regressors 

specification  of  subset  VAR  model, 
211 
Shock 

permanent  — ,  369 
transitory  — ,  369 
Simulation  techniques 
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evaluating  properties  of  estimators  by 
— ,  707-709 

evaluating  properties  of  test  statistics 
by  — ,  707-709 

Simultaneous  equations  model,  see 
SEM 
Skewness 

asymptotic  distribution  of  — ,  175, 
178 

measure  of  multivariate  — ,  174-180 
Slutsky’s  theorem,  683 
Small  sample  properties 

—  of  LS  estimator,  80-82 

—  of  VAR  order  selection  criteria, 
151-157 

—  of  estimated  forecasts,  100-102 

—  of  estimators,  707-709 

—  of  test  statistics,  707-709 
investigation  of  -,  80,  707-709 

Smoothing,  630 
Smoothing  matrix 
Kalman  — ,  630 
Smoothing  step,  630 
Specification  of 

—  EC-ARMAhs  form,  523-526 
VAR  models,  135-157 

—  VARMA  models,  493-508 

—  cointegrated  systems,  325-344 

—  dynamic  SEMs,  400-401 

—  echelon  form,  498-507 

—  error  correction  echelon  form, 
523-526 

—  final  equations  form,  494-498 

—  subset  VAR  models,  206-212 
Specification  of  cointegrating  rank 

—  of  EC-ARMAflB  form,  525-526 

—  of  VAR  process,  327-343 

—  of  error  correction  echelon  form, 
525-526 

Square  root  of  a  matrix,  658 
Stability  condition,  15,  16 
Stability  of  a  VARMA  process,  423 
Stable 

VAR  process,  13-18 
—  VARMA  process,  423 
Standard  percentile  confidence  interval, 
710 

Standard  VARMA  representation,  448 
Standard  white  noise,  73 


State  space  model 

estimation  of  — ,  631-637 
global  identification  of  — ,  634 
identification  of  — ,  633-634 
local  identification  of  — ,  634 
log-likelihood  function  of  — ,  631-633 
ML  estimation  of  — ,  631-637 
nonlinear  — ,  623-625 
State  space  representation 

—  of  VAR  process,  614-616 

—  of  VARMA  process,  616 

—  of  VARX  process,  616 

—  of  VARX  process  with  system¬ 
atically  varying  coefficients, 

621 

—  of  factor  analytic  model,  619-621 
-  of  random  coefficient  VARX 

model,  621-623 
State  vector,  611,  613 
Stationarity 

asymptotic  — ,  241 
strict  -  ,  24 

Stationarity  condition  for  VAR  process, 
25 

Stationary  point  of  a  function,  671 
Stationary  stochastic  process,  24-26 
strictly  — ,  24 

Stationary  VAR  process,  24-26 
Step  direction,  469 
Stochastic  convergence,  681-684 

—  almost  surely,  682 

—  in  distribution,  682 
in  law,  682 

—  in  mean  square  error,  682 

—  in  probability,  681 

in  quadratic  mean,  682 

—  with  probability  one,  682 
strong  — ,  682 

weak  — ,  682 
Stochastic  process 

cointegrated  — ,  244-256 
discrete  — ,  3 
MA,  420-423 
multivariate  — ,  3 

nonstationary  — ,  237,  585-586,  614, 
621-623 

periodic  — ,  591-594 
VAR,  13-18 
VARMA,  423-426 
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VARX,  387,  616,  621-623 
Stochastic  trend,  238 
Stochastic  volatility  model,  583 
Strictly  exogenous  variable,  389 
Strictly  stationary  stochastic  process, 

24 

Strong  law  of  large  numbers,  689 
Strongly  exogenous  variable,  388 
Structural  analysis 

—  of  VARMA  models,  441-444 

—  of  cointegrated  system,  261-264 

—  of  cointegrated  systems,  316-322 

—  of  dynamic  SEMs,  406-408 

of  subset  VAR  models,  205-206, 
221 

Structural  change,  182 

Chow  test  for  — ,  182-184,  348-349 
prediction  test  for  — ,  349-351 
testing  for  — ,  182-189,  348-351,  510, 
598-601,  608 
Structural  form 

—  of  a  VAR  process,  358 

—  of  a  dynamic  SEM,  390 
Structural  impulse  responses,  359, 

377-382 

Structural  innovation,  359 
permanent  — ,  369 
transitory  — ,  369 
Structural  models 
VAR,  357-386 
VECM,  357-386 

Structural  time  series  model,  618-619 
Structural  VAR,  358-368 

—  with  Blanchard-Quah  restrictions, 
367-368 

—  with  long-run  restrictions,  367-368 
AB-model,  364-367 

A-model,  358-362 
B-model,  362-364 

Structural  vector  autoregression,  see 
structural  VAR 

Structural  vector  error  correction 
model,  368-372 
Submatrix,  659 
Subset  model 

bottom-up  procedure  for  — ,  344 
full  search  procedure  for  -  ,  344 
sequential  elimination  of  regressors, 
344 


top-down  procedure  for  — ,  344 
Subset  VAR  model,  206-221 
checking  of  — ,  212-217 
specification  of  — ,  206-212 
bottom-up  strategy,  211 
sequential  elimination  of  regressors, 
211 

top-down  strategy,  208-210 
structural  analysis  of  — ,  221 
Super-exogenous  variable,  388 
Superconsistent  estimator,  288,  301 
SVAR,  357-368 

—  with  Blanchard-Quah  restrictions, 
367-368 

—  with  long-run  restrictions,  367-368 
AB-model,  364-367 

A-model,  358-362 
B-model,  362-364 
Blanchard-Quah  — ,  367-368 
concentrated  likelihood  function,  373 
estimation  of  — ,  372-376 
ML  estimation  of  — ,  372-376 
SVECM,  368-372 
estimation  of  — ,  376-377 
ML  estimation  of  — ,  376-377 
Symmetric  matrix,  646 
System  equation,  611 
System  matrix,  613 
System  with  exogenous  variables, 
388-412 

Systematic  sampling,  616-618 
Systematically  varying  coefficients 

—  of  VAR  models,  585-589 

—  of  VARX  models,  621 

Taylor  expansion,  671 
second  order  ,  671 
Taylor’s  theorem,  670,  685 
Temporal  aggregation,  434-435, 
440-441,  616-618 
Testing  for 

—  Granger-causality,  102-104, 
316-321 

—  causal  relations,  102-108,  316-321 

—  instantaneous  causality,  104-108 

—  multi-step  causality,  105-108 
nonnormality,  174-180 

—  periodicity,  598-604 
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—  rank  of  cointegration,  327-343, 
551-552 

—  structural  change,  181  189, 
348-351,  510,  598-601,  608 

—  whiteness  of  residuals,  169-174, 
214,  510 

nonnormality 

—  of  VAR  process,  177-180 

—  of  white  noise  process,  174-177 
residual  autocorrelation 

—  of  VAR  process,  169-174 

—  of  VARMA  process,  510 

—  of  subset  VAR  model,  214 

—  of  white  noise  process,  157-161 
structural  change 

based  on  one  forecast  period, 
182-186 

based  on  several  forecast  periods, 
186-188 

Tests  of  parameter  restrictions 
linear  restrictions,  102,  138-143 
nonlinear  restrictions,  508-510 
Threshold  models,  625 
Time  invariant 

—  autocovariances,  597 

—  coefficients,  596 
Time  series 

nonstationary — ,  237 
seasonal  — ,  585 
Time  varying 

—  coefficients,  585-591 
randomly  — ,  621-623 
systematically  — ,  585-591 

Top-down  strategy  for  subset  VAR 
specification,  208-210 
Total  forecast  error  impulse  responses, 
56 

Total  impact  matrix,  367 
Total  impulse  responses,  56 
Total  multiplier,  392 
Trace  of  a  matrix,  653 
Trace  test  for  cointegration  rank,  329 
Transfer  function,  392 
Transfer  function  model,  387,  392 
rational  — ,  392 
Transformation 

—  of  MA  process,  435-436 

—  of  VARMA  process,  436-440 
linear  — ,  435-440 


Transition  equation 

—  errors,  613 

—  noise,  613 

of  a  state  space  model,  611 
Transition  matrix,  613 
Transitory  shock,  369 
Transpose  of  a  matrix,  646 
Trend 

deterministic  -  ,  238 
stochastic  — ,  238 
Triangular  matrix 
lower  — ,  646 
upper  — ,  646 

Triangular  representation  of  cointe¬ 
grated  system,  251 
Two-stage  estimation 

—  of  cointegrated  system,  301-302 
asymptotic  properties  of  — ,  301 

Unconditional  forecast,  402 
Unimodular  operator,  452 
Univariate  ARCH  model,  559-562 
Univariate  GARCH  model,  559-562 
Unmodelled  variable,  387-390 

VAR  estimation 

fully  modified  — ,  318 
VAR  order  estimator 
consistent  — ,  148 
small  sample  properties  of  — , 
151-157 

strongly  consistent  — ,  148 
VAR  order  selection 

AIC  criterion  for  — ,  147 
comparison  of  criteria  for  — ,  150-157 
consistent  — ,  148-150 
criteria  for  — ,  146-157 
FPE  criterion  for  -  ,  146 
HQ  criterion  for  — ,  150 
SC  criterion  for  — ,  150 
sequence  of  tests  for  — ,  136-145 
testing  scheme  for  — ,  143-144 
VAR  process,  5 

—  with  linear  parameter  restrictions, 
194-221 

with  nonlinear  parameter 
restrictions,  221-222 

—  with  parameter  constraints, 
193-231 
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with  time  varying  coefficients, 
586-591 

autocorrelations  of  — ,  30-31 
autocovariances  of — ,  21,  26-30 
checking  the  adequacy  of  — ,  157-189 
estimation  of  — ,  69-93,  531-536 
forecast  error  variance  decomposition 
of  -,  63-66 

forecasting  of  — ,  31-41,  93-102, 
536-540 

impulse  response  analysis  of  — , 
108-129,  540-545 
infinite  order  — ,  531-552 
LS  estimation  of  — ,  69-86,  531-536 
MA  representation  of  — ,  18-24 
mean-adjusted  — ,  82 
nonstationary  — ,  256,  586-594 
order  determination  of  — ,  135-157 
order  estimation  of  -,  146-157 
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